Talk:Development issues: Difference between revisions

From OLPC
Jump to navigation Jump to search
(cachegrind notes.)
Line 39: Line 39:
*** The Python interpreter should profile how it executes bytecode and rearrange it to be friendlier to cache.
*** The Python interpreter should profile how it executes bytecode and rearrange it to be friendlier to cache.
*** Python modules and scripts written in Python (instead of C) could be pre-compiled to native.
*** Python modules and scripts written in Python (instead of C) could be pre-compiled to native.
*** The Psyco JIT could be used, although start-up time would increase.
*** The Psyco or PyPy JIT could be used, although start-up time and memory usage would increase.
**** The JIT could possibly be modified to cache compiled code in a PIC manner, allowing it to effectively generate pre-compiled Python on the fly. The problem with a JIT is it can't share code memory; by writing back the compiled native code to a file and later mmap()ing it in, future runs could avoid JITing the code and parallel runs using the same chunks of code could use the copies in the same file, sharing the memory.
**** The JIT could possibly be modified to cache compiled code in a PIC manner, allowing it to effectively generate pre-compiled Python on the fly. The problem with a JIT is it can't share code memory; by writing back the compiled native code to a file and later mmap()ing it in, future runs could avoid JITing the code and parallel runs using the same chunks of code could use the copies in the same file, sharing the memory.



Revision as of 20:23, 18 September 2006

Any plans to use any of the "low memory footprint" C libraries (uClibC, etc.)? --pdinoto

Don't think so: I18N and compatibility says otherwise. I think there are better places to go hunting. - jg
Odd. Zaurus uses some of the "low memory footprint" libraries and it supports multiple scripts including Japanese, English and Cyrillic. It is based on the QT library which has very good I18N and Unicode support. Recently, Trolltech has been selling a Qtopia Mobile Phone Edition that is being used on mobile phones in countries such as China.
If there is, in fact, an I18N issue with a "low memory footprint" library, then it should be investigated with a view to fixing it. Every byte we can trim the base system is one more byte for educational content.

will java (SE,ME) be supported? i know java is supposed to be huge memory hog, but carefully written app can consume as little as 16 MB together with java virtual machine - wolf


Is there any discussion of using a microkernel, such as L4, GNU/Hurd, or the Linux on L4? These aren't complete, but, a high profile project such as this could speed the development of any one of these. A small micro-kernel that is highly tuned to the individual CPU being used seems like it would increase performance and lower battery usage and could be very good for a project like this.

Are you thinking about minix3???

I understand that Python is the primary development language. However, will C/C++ be available as well? I have a character-based app. that could be ported to OLPC.

CPU Issues

I've taken a look at the CPU and run some cachegrind simulations (on Rhythmbox playing Frank's 2000 Inch TV), making note of a few things.

For those of you wishing to play with cachegrind, you can do so on any old x86. Use I1 and D1 values of 16384,4,32 and set L2 to 64,4,32 (because cachegrind demands there be L2), then ignore stats about L2 cache.

  • There is NO L2 CACHE. An L1 miss costs 25 cycles.
    • Remember that sequential reads are very fast because the cache deals with them a lot better.
    • There is an efficient prefetch mechanism on the Geode GX; the compiler should use it by default, I don't know if we can help in any way (i.e. calculate addresses and array indexes earlier? It probably reorders for this...)
  • L1 cache is 32KiB, with 16KiB I1 and 16KiB D1.
    • The D1 miss rate is 1.9%
    • The L1 miss rate total is 1.2%
    • Multiplying the miss rate 0.012 by the expense of 25 cycles, we see a 33.2% slowdown due to lack of cache. There are very few ways to handle this; but one interesting idea is to rework the memory allocator (i.e. malloc()) to focus on cache locality.
      • Hoard is generally a good allocator but it may not be optimal here.
      • I hear FreeBSD's memory allocator focuses on cache locality but I haven't looked.
      • I'm writing my own allocator; I have a special scheme for small allocations that will hopefully improve cache locality. It can be implemented separate and actually drop in straight over existing malloc() to manage qualifying allocations, throwing others back to the existing allocator. I'm working on getting an interposer working to see if this helps or not.
  • There are only 8 ITLB, 8 DTLB, and 64 L2 TLB entries.
    • The ITLB can be helped by the linker locating functions calling other functions and pulling them closer together.
      • I'm not sure how the linker rearranges functions; I'm guessing it doesn't take them out of separate objects and rearrange the code on such scale. This kind of behavior would however allow it to pull functions calling each other into the same page and thus be friendly to the TLB.
    • -Os may be better or worse than -O2; -Os could result in more functions packed into a smaller area, saving cache and TLB entries.
    • The DTLB can be helped by using a better allocator. I'm looking into this; as with the cache considerations, I'm going to be doing some TLB considerations.
  • Python may be a slight stumbling block; but we can work on this a little.
    • Python script is compiled into bytecode
      • We are aggravating the D1 problem because our 'code' is now data and affects that cache.
      • The Python interpreter should profile how it executes bytecode and rearrange it to be friendlier to cache.
      • Python modules and scripts written in Python (instead of C) could be pre-compiled to native.
      • The Psyco or PyPy JIT could be used, although start-up time and memory usage would increase.
        • The JIT could possibly be modified to cache compiled code in a PIC manner, allowing it to effectively generate pre-compiled Python on the fly. The problem with a JIT is it can't share code memory; by writing back the compiled native code to a file and later mmap()ing it in, future runs could avoid JITing the code and parallel runs using the same chunks of code could use the copies in the same file, sharing the memory.

It may be fruitful to write a library for small allocations such as linked lists and start rewriting code. This is more work but has much better potential. The advantage here is that programmers know better how they're going to allocate things, and can do a better job than a generic allocator. It would also be possible to use such a library to allocate linked lists and then actually exchange them in place such that the actual linked list was kept sequential in memory, intentionally keeping cache together; this may be a win in some places and a lose in others.

Memory Issues

OLPC has 128M of memory so we have to be quite careful with what goes on. Here's a few thoughts:

  • Memory Compression
    • Nitin Gupta is working on this at http://linuxcompressed.sourceforge.net now.
    • There is an Ubuntu Specification for Compressed Memory using Nitin's patches.
    • Nitan is willing to continue development and possibly pursue mainline, but he needs to get money from somewhere; he may approach Ubuntu for funding, but if there's money and programmers to spare it may be advantageous to approach him with some cash.
  • Efficient memory allocator
    • Look into other memory allocators like the one in FreeBSD or Hoard.
    • I'm working on my own, not sure how it will perform but the original intent was to improve space efficiency.
    • Pre-compile Python scripts so that they can be mmap()ed into memory.
      • Bytecode compiled Python takes memory individually, plus takes more time to run
      • Just-in-Time compiled Python (Psyco) allocates individual code memory for individual processes that JIT the code
      • Unmodified file-backed mmap() areas are shared between processes; pre-compiled Python code will use this method and save a few pages