The cell broadband engine in the PS/3 is a heterogenous multi-core processor. There is a general purpose dual-threaded powerpc core, called the PPU (also called PPE if you include the caches), and there is the SPU (called SPE when caches are included). This means that your application needs to target two processor architectures. You will need separate compiler toolchains for both the PPU and SPU. Normally, the Cell chip contains 8 SPU cores. In the PS/3 however, one core is disabled to increase the yield of the manufactoring process. When running GNU/Linux on the PS/3, an additional core is unavailable because it is used to run the hypervisor GameOS from Sony.
The SPU is a SIMD architecture, that operates on 128bit data. It is where the real performance of the PS/3 is. The SPU has a small memory, called local store, of only 256Kbyte memory. The PPU has 256Mbyte memory. To run anything on the SPU cores, it is the task of the PPU to load a program, and execute it on the SPU. Control over the SPU cores by the PPU is provided by the libspe2 library. The local store is not shared memory, so to transfer data to or from the PPUs memory, DMA operations are required. It is also possible to do DMA between local stores of two SPUs. The DMA is typically initiated by the SPU, but PPU initiated DMA is also possible.
When the SPU is initiating a DMA transfer, it needs to now 'effective addresses' for the target buffer in the PPU memory. These effective addresses can be tricky to obtain, but as a convenience, IBM provides a linking technology called CESOF linking. Using CESOF, the PPU and SPU can share a symbol, and use DMA to update the SPU's copy from and to the PPU's copy.
The SPU is very sensitive to stalls, often induced by data dependencies. To investigate the performance statically, spu-timing can be invoked on a SPU assembly (.s) file. If dynamic behavior needs to be investigated as well, that also examines the dynamic branching, the cell simulator needs to run and analyze your code. You can save yourself from quite some performance tuning by adopting application specific libraries that are provided with IBM's SDK. There are all sorts of libraries available that have streamlined implementations for things like audio, image, math, 3D and such. These reside in /opt/ibm/cell-sdk/prototype/src/lib/
All the SDK examples still use oldstyle libspe version 1 code. I've modified the cesof_simple.c example to make use of libspe2. Below is the modified code. This is an example on how to do proper development using a modern userspace lib, and using CESOF to share symbols between SPU and PPU.
// shared_data.h // defines a vector of size 4, to be used by both ppe and spe typedef union { float vals[4]; vector float vect; } vec_t;
// cesof_simple.c // modified by Bram Stolk for use with libspe2 #include#include #include #include "shared_data.h" /* allocate memory objects located in the effective address space */ vec_t vectors[512] __attribute((aligned(128))); /* * This is the spe program handle structure. * We have this already in-core due to cesof linking. * If we did not embed the SPU code, we would have to load it using * spe_image_open() instead. */ extern spe_program_handle_t spu_foobar; int main() { int i, status; /* initialize the memory objects */ for (i=0; i<512; i++) vectors[i].vect = (vector float){i,i,i,0}; unsigned int createflags = 0; spe_context_ptr_t spe = spe_context_create(createflags, NULL); status = spe_program_load(spe, &spu_foobar); if (status) perror("spe_program_load failed"); unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; void * argp = NULL; void * envp = NULL; status = spe_context_run(spe, &entry, runflags, argp, envp, NULL); if (status < 0) perror("spe_context_run failed"); fprintf(stderr,"Result = %f %f %f %f\n", vectors[0].vals[0], vectors[0].vals[1], vectors[0].vals[2], vectors[0].vals[3]); status = spe_context_destroy(spe); if (status) perror("spe_context_destroy failed"); return 0; }
// spu_foobar.c #include#include #include "decl.h" #include "../shared_data.h" /* reference the memory objects in the effective address space */ /* e.g. for ea memory object "foo[512]" */ /* extern unsigned long long _EAR_foo; */ /* int _LOCAL_foo[512]; */ /* EXTERN_EAR expands to the example above */ #define EXTERN_EAR(sym, type, size) \ extern unsigned long long _EAR_##sym; \ type _LOCAL_##sym[size] __attribute((aligned(128))); EXTERN_EAR(vectors, vec_t, 512); int main (long long spuid __attribute__ ((__unused__)), char** argp __attribute__ ((__unused__)), char** envp __attribute__ ((__unused__))) { int i; printf("effective address of vectors is %llx\n", _EAR_vectors); /* Transfer from system memory to local store using DMA */ UPDATE_LOCAL(vectors); /* local operation */ vec_t totals[4]; totals[0].vect = (vector float){0,0,0,0}; totals[1].vect = (vector float){0,0,0,0}; totals[2].vect = (vector float){0,0,0,0}; totals[3].vect = (vector float){0,0,0,0}; for (i=0; i<512; i+=4) { // Do 4 vector adds per iteration, so that we have a lot less processor stalls. totals[0].vect = spu_add(totals[0].vect, _LOCAL_vectors[i+0].vect); totals[1].vect = spu_add(totals[1].vect, _LOCAL_vectors[i+1].vect); totals[2].vect = spu_add(totals[2].vect, _LOCAL_vectors[i+2].vect); totals[3].vect = spu_add(totals[3].vect, _LOCAL_vectors[i+3].vect); } _LOCAL_vectors[0].vect = (vector float) { totals[0].vals[0]+totals[1].vals[0]+totals[2].vals[0]+totals[3].vals[0], totals[0].vals[1]+totals[1].vals[1]+totals[2].vals[1]+totals[3].vals[1], totals[0].vals[2]+totals[1].vals[2]+totals[2].vals[2]+totals[3].vals[2], totals[0].vals[3]+totals[1].vals[3]+totals[2].vals[3]+totals[3].vals[3], }; /* Transfer from local store to system memory using DMA */ UPDATE_REMOTE(vectors); printf("Done!\n"); return 0; }
.section .toe, "a", @nobits .align 4 .global _EAR_vectors _EAR_vectors: .octa 0x0In 2010, the Sony's bomb on OtherOS dropped, so I thought I would update. Here is a complete guide to the 0.15 Tflop/s hello world.