mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2009-07-13, 13:17   #12
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3×1,181 Posts
Default

Hmm, maybe NFS linear algebra can benefit from this (and pstach's CADO presentation).
jasonp is offline   Reply With Quote
Old 2009-07-25, 01:36   #13
SPWorley
 
Jul 2009

111112 Posts
Default

Quote:
Originally Posted by frmky View Post
I've written just a few simple programs. Besides the obvious things like have lots of threads to mask latencies and access global memory as little as possible, getting the best performance really seems to be trial and error. I was involved in the integration of distributed.net's RC5 CUDA core. I found that seemingly trivial changes like changing an always positive integer from signed to unsigned, and assigning blockDim.x, blockIdx.x, and threadIdx.x to an integer before doing ANY math made huge differences in the speed of the code. In fact, just the move to CUDA 2.2 has halved the speed of the RC5 code. I recently asked why on the nVidia forums but haven't gotten any replies.
Lots of possible reasons.. I just replied to you on that forum, but admittedly I haven't tried compiling the code yet. Especially since I don't have 2.0 around anymore.
SPWorley is offline   Reply With Quote
Old 2009-10-17, 18:47   #14
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

644110 Posts
Default

I can get a cheap factor 5 speed-up for the program above by splitting primes into small primes (the first eight) and large primes (all the rest); large primes are handled with the Sieve kernel above, one prime per thread, and small primes handled one chunk of memory per thread by

Code:
__global__ static void SplitSieve(char* sieve, int prime, int zap, int sieve_size)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i=prime*(prime+idx); i<sieve_size; i+=prime*zap)
    sieve[i]=0;
}

for (int r=0; r<nsmall; r++)
{
	      SplitSieve<<<400, 1, 0>>>(device_big_sieve,primes[r],400,N);
	      cudaThreadSynchronize();
}
This is still desperately inefficient; I'm doing half a billion one-byte writes, when it ought to be possible to do something coalesced for the small primes. I assume I can't use a texture 179 units wide and let the texture-fetch unit do my modulo operations for free

I'm a bit confused for coalesced operations about how I tell which thread in the warp I am - is it threadId%32, or does it depend on the launch parameters; is this even a meaningful question?

Atomic operations on the GTX275 are supposedly fairly fast; packing eight odd numbers per byte should allow very quick bulk writing for the small primes, and then AtomicAnd to zap bits one at a time for the large primes.

I'm noticing that small black and white spots appear and disappear on my desktop, not only while I'm running cuda jobs but afterwards (it may be connected to having a window well behind the one I'm writing in scrolling upwards); presumably this is a Linux driver issue triggered by cuda doing something odd in main memory.
fivemack is offline   Reply With Quote
Old 2009-10-19, 09:57   #15
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

11001001010012 Posts
Default

The texture-mapping approach works nicely; can fill 200MB with the pattern with primes <30 removed in about 6ms.

Things that bit me:

it's kernel<<<NUMBER OF BLOCKS IN GRID, THREADS PER BLOCK>>> not the other way round

you want to index on threadIdx.x (not on gidx = blockDim.x*blockIdx.x+threadIdx.x) when filling shared memory at the start of a kernel

blockDim.x is the number of threads in the block; I can't figure out how to get at the number of blocks in the grid from the kernel, but obviously you can pass it as a parameter.

loads from global memory are slow and not cached at all. shared memory is fast. I haven't used texture memory yet.

you can't just fill the shared memory when blockIdx.x==0 and trust that it's still there the next time through the grid

<<<30,256>>> is a good conformation on the GTX275 for texture-mapping-like jobs.

Things to try: texture memory, atomic bitwise operations for the larger primes.
fivemack is offline   Reply With Quote
Old 2009-10-19, 15:00   #16
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

42228 Posts
Default

Quote:
Originally Posted by fivemack View Post
it's kernel<<<NUMBER OF BLOCKS IN GRID, THREADS PER BLOCK>>> not the other way round
Should have warned you when I discovered that. Sorry.
Quote:
Originally Posted by fivemack View Post
you want to index on threadIdx.x (not on gidx = blockDim.x*blockIdx.x+threadIdx.x) when filling shared memory at the start of a kernel

you can't just fill the shared memory when blockIdx.x==0 and trust that it's still there the next time through the grid
I too have trouble remembering that shared memory is shared in and persists for the lifetime of a block, not an entire grid.
Quote:
Originally Posted by fivemack View Post
blockDim.x is the number of threads in the block; I can't figure out how to get at the number of blocks in the grid from the kernel, but obviously you can pass it as a parameter.
gridDim.x (and gridDim.y if 2D)
Quote:
Originally Posted by fivemack View Post
loads from global memory are slow and not cached at all. shared memory is fast. I haven't used texture memory yet.
Never used it, but supposedly writing is the same to global memory while reads, even somewhat random ones as long as the data read is nearby, is fast.
Quote:
Originally Posted by fivemack View Post
<<<30,256>>> is a good conformation on the GTX275 for texture-mapping-like jobs.
The grid size needs to be a multiple of the number of MPs, and as I've just learned from Jason a multiplier of 1 works well.
Quote:
Originally Posted by fivemack View Post
Things to try: texture memory, atomic bitwise operations for the larger primes.
Good luck!
frmky is offline   Reply With Quote
Old 2009-11-10, 16:24   #17
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24·32·5 Posts
Default Need a dumb program

There is a way to do fast multiprecision multiplies using a SIMD-in-a-register technique, but runtime would be something like 2x-4x that of a floating-point FFT.

Since the algorithm uses only AND, OR, XOR, NAND, shuffles and shifts, it might have an advantage if one considers energy consumption.

I don't have an NVDIA card, but maybe someone is willing to code and run two simple (and stupid) test programs to compute

Code:
(watts(program 1)-watts(idle))/(watts(program 2)-watts(idle))
preferably by measuring power-consumption directly at the power-supply. If that isn't possible, maybe one can get an estimate by setting the fan-speed on the card constant and compute

Code:
(maxtemp(program 1)-maxtemp(idle))/(maxtemp(program 2)-maxtemp(idle))
by measuring the GPU temperature, instead.

Program 1: fill an array 'x' of int with 0xFFFFFFFF, and repeatedly compute y=y xor x.

Program 2: fill the array 'x' of float with -1/3, and repeatedly compute
y=x + y*y, using a fused multiply-accumulate instruction.

Arrays should be large enough to maximize throughput.
__HRB__ is offline   Reply With Quote
Old 2011-03-05, 18:04   #18
alexhiggins732
 
Mar 2010
Brick, NJ

6710 Posts
Default

Quote:
Originally Posted by fivemack View Post
I can get a cheap factor 5 speed-up for the program above by splitting primes into small primes (the first eight) and large primes (all the rest); large primes are handled with the Sieve kernel above, one prime per thread, and small primes handled one chunk of memory per thread by

Code:
__global__ static void SplitSieve(char* sieve, int prime, int zap, int sieve_size)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i=prime*(prime+idx); i<sieve_size; i+=prime*zap)
    sieve[i]=0;
}

for (int r=0; r<nsmall; r++)
{
          SplitSieve<<<400, 1, 0>>>(device_big_sieve,primes[r],400,N);
          cudaThreadSynchronize();
}
Launching many kernels introduces overhead.

Quote:
Originally Posted by fivemack View Post

This is still desperately inefficient; I'm doing half a billion one-byte writes, when it ought to be possible to do something coalesced for the small primes.
A quick look at this and a major cause of slowdown is thread starvation. The code above was running something like 3 threads on a block which each thread sieving a single prime for the the entire sieve. Just to overcome global memory latency you need 50% occupancy.

I was able to get a large speed up by assigning each block to a prime and having each thread sieve a partition of the entire sieve.

Instead of

Code:
__global__ static void Sieve(u8 * sieve,int * primes, int maxp, unsigned int sieve_size)
{ 
  int idx = blockIdx.x * blockDim.x + threadIdx.x; 
  if ((idx>0) && (idx < maxp))
    {
      int ithPrime = primes[idx];
      for(unsigned int i=(3*ithPrime)>>1 ;i < sieve_size; i+=ithPrime)
        // i = (ithPrime-1)/2 + ithPrime though the compiler knew this
        sieve[i] = 1;
    }
}
Use this
Code:
__global__ static void Sieve2(u8 * sieve,int * primes, int maxp, unsigned int sieve_size)
{ 
    int tx=  threadIdx.x; 
    int bx = blockIdx.x;
    int threads_per_block = blockDim.x;
    int block_size=    sieve_size / threads_per_block;// default: 134217728/256=524288
    if  ((bx>0) && (bx < maxp))
    {
        int p = primes[bx];
        unsigned int i ; //=(sieve_size / threads_per_block); 
        int lim;
        if (tx==0)
        {
            i=(3*p)>>1; // set to next multiple of p that is not even, eg 3*p
            lim=block_size;
        }
        else
        {
                        
            i=block_size*tx;
            lim=i+block_size;

            i <<= 1;
            i -= 1;
            i/=p; // set i to largest i satisifying  of (ithPrime*i) < start
            i*=p; // set i = (ithPrime*i) < start
            i+=p; // set i to smalles value of (ithPrime*i>start)
            if ((i&1)==0) i+=p; // this should never happen
            i -= 1;
            i >>= 1;
            //i>>=1;
        }
        if (lim>sieve_size) 
          lim=sieve_size; // max sure lim=<sieve_size so we don't go past the size of our array
        for(;i<lim;i+=p)
           sieve[i] = 1;
    }
  
}
It sieves up to 2^30 in 2 seconds on a GTX 260. Yafu takes 2.4375 seconds to sieve the same range.

By compacting the bytes into a bit array,simply by changing sieve[i] = 1; to sieve[i>>3] = 1 <<(i&7);, the sieve only takes only ~.6 seconds up to 2^30. However, this introduces a race condition which spits out inaccurate results.

Still, this is a very unoptimized implementation which runs on par with very optimized versions of the CPU sieve so maybe this should not be written off as undo-able yet.

I think the best approach may be to have all primes sieve a partition of the sieve in shared memory, using the bitarray approach, and then copy the shared memory to global memory in a bank conflict free and coalesced manner. Playing around with it that approach, I can achieve a computation time of .2 seconds for up to 2^32 and similar time writing to global memory. That gives me hope that a range of 2^32 should be able to be sieved in ~.5 seconds

However attempts at combining the two ends up with race conditions and bank conflicts which hit performance dramatically, but I still have few more ideas to try.
alexhiggins732 is offline   Reply With Quote
Old 2012-03-06, 14:30   #19
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

10010110111102 Posts
Default

Quote:
Originally Posted by alexhiggins732 View Post
I think the best approach may be to have all primes sieve a partition of the sieve in shared memory, using the bitarray approach, and then copy the shared memory to global memory in a bank conflict free and coalesced manner. Playing around with it that approach, I can achieve a computation time of .2 seconds for up to 2^32 and similar time writing to global memory. That gives me hope that a range of 2^32 should be able to be sieved in ~.5 seconds

However attempts at combining the two ends up with race conditions and bank conflicts which hit performance dramatically, but I still have few more ideas to try.
Any news?

Luigi
ET_ is offline   Reply With Quote
Old 2012-06-26, 18:51   #20
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

Recently discovered that one of the servers at work has a Tesla M2050 running CUDA 3.2. So I thought I'd make my first foray into GPU programming by trying out this problem.

Below is a segmented sieve that counts the primes in the 10^8 interval in 22 milliseconds, or about 200x faster (not quite 11k faster, but still not so shabby).

A 2^30 range takes 250 milliseconds, which is about twice as fast as YAFU's single threaded cpu version on a X5680. So far though, a multi-threaded cpu will still beat the gpu. But I have a bunch more optimizations left to try on the gpu (bit packing, addtional "spokes" in the wheel sieve, coalesced writes for the smallest primes, etc.).

code:
Code:
#include <stdio.h>
#include <cutil_inline.h>
typedef unsigned char u8;
typedef unsigned int uint32;
typedef unsigned char uint8;
typedef long long unsigned int uint64;
int maxT = 0;
const int threadsPerBlock = 256;
const uint32 block_size = 8192;
__global__ static void SegSieve(uint32 *count, uint32 * primes, int maxp, int nump, uint32 N)
{ 
 uint32 j,k;
 uint32 bid = blockIdx.y * gridDim.x + blockIdx.x; 
 uint32 range = block_size / threadsPerBlock;
 __shared__ u8 locsieve[block_size];
 // everyone init the array.
 for (j=threadIdx.x * range, k=0; k<range; k++)
 {
  // we're counting hits in the kernel as well, so clear the bytes representing primes > N
  if ((2*(bid * block_size + j + k)+1) < N)
   locsieve[j+k] = 1;
  else
   locsieve[j+k] = 0;
 }
 
 // regroup before sieving
 __syncthreads();
 
 // now sieve the array
 for (j=0; j<nump; j++)
 {
  // + 1 because we don't sieve with p=2
  int pid = (j * threadsPerBlock) + threadIdx.x + 1;
  if (pid < maxp)
  {
   uint32 p = primes[pid];
   uint32 block_start = bid * block_size;
   uint32 start_offset;
 
   // compute the starting location of this prime in this block
   if ((bid == 0) || (p >= block_start))
    start_offset = p/2 + p - block_start;
   else
   {
    uint32 dist = (block_start - p/2);
    uint32 steps = dist / p;
    if ((dist % p) == 0)
     start_offset = (steps * p) + p/2 - block_start;
    else
     start_offset = (steps * p) + p + p/2 - block_start;
   }
   for(uint32 i=start_offset ;i < block_size; i+=p)
    locsieve[i] = 0;
  }
 }
 
 // regroup before counting
 __syncthreads();
 
 // each thread sum a range of the array
 for (j=threadIdx.x * range + 1, k=0; k<range-1; k++)
  locsieve[threadIdx.x * range] += locsieve[j+k];
 
 // regroup before reducing
 __syncthreads();
 
 // finally return the total count in global memory
 if (threadIdx.x == 0)
 {  
  k = 0;
  for (j=0; j<block_size; j+=range)
   k += locsieve[j];
  count[bid] = k;
 }
 
}
uint32 tiny_soe(uint32 limit, uint32 *primes)
{
 //simple sieve of erathosthenes for small limits - not efficient
 //for large limits.
 uint8 *flags;
 uint32 prime;
 uint32 i,j;
 int it;
 //allocate flags
 flags = (uint8 *)malloc(limit/2 * sizeof(uint8));
 if (flags == NULL)
  printf("error allocating flags\n");
 memset(flags,1,limit/2);
 //find the sieving primes, don't bother with offsets, we'll need to find those
 //separately for each line in the main sieve.
 primes[0] = 2;
 it=1;
 
 //sieve using primes less than the sqrt of the desired limit
 //flags are created only for odd numbers (mod2)
 for (i=1;i<(uint32)(sqrt(limit)/2+1);i++)
 {
  if (flags[i] > 0)
  {
   prime = (uint32)(2*i + 1);
   for (j=i+prime;j<limit/2;j+=prime)
    flags[j]=0;
   primes[it]=prime;
   it++;
  }
 }
 //now find the rest of the prime flags and compute the sieving primes
 for (;i<limit/2;i++)
 {
  if (flags[i] == 1)
  {
   primes[it] = (uint32)(2*i + 1);
   it++;
  }
 }
 free(flags);
 return it;
}
bool InitCUDA(void)
{
  int count = 0;
  int i = 0;
 
  cudaGetDeviceCount(&count);
  if(count == 0) {
    fprintf(stderr, "There is no device.\n");
    return false;
  }
  for(i = 0; i < count; i++) {
    cudaDeviceProp prop;
    if(cudaGetDeviceProperties(&prop, i) == cudaSuccess) {
      if(prop.major >= 1) {
        printf("Device %d supports CUDA %d.%d\n",i, prop.major, prop.minor);
        printf("It has warp size %d, %d regs per block, %d threads per block\n",prop.warpSize, prop.regsPerBlock, prop.maxThreadsPerBlock);
        printf("max Threads %d x %d x %d\n",prop.maxThreadsDim[0], prop.maxThreadsDim[1], prop.maxThreadsDim[2]);
        maxT = prop.maxThreadsDim[0];
        printf("max Grid %d x %d x %d\n",prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2]);
        printf("total constant memory %d\n",prop.totalConstMem);
        break;
      }
    }
  }
  if(i == count) {
    fprintf(stderr, "There is no device supporting CUDA 1.x.\n");
    return false;
  }
  cudaSetDevice(i);
  return true;
}
int main(int argc, char** argv)
{
 uint64 N = 100000;
 uint32 Nsmall;
 int numblocks;
 int primes_per_thread;
 uint64 array_size;
 uint32* primes;
 uint32* device_primes;
 uint32* block_counts;
 uint32 np;
 unsigned int thandle; 
 uint32* block_counts_on_host;
 
 if (argc > 1)
  N = strtoull(argv[1],NULL,10);
 Nsmall = (uint32)sqrt(N);
 
 if(!InitCUDA())
 {
  return 0;
 } 
 primes = (uint32*)malloc(Nsmall*sizeof(uint32));
 np = tiny_soe(Nsmall, primes);
 printf("%d small primes (< %d) found\n", np, Nsmall);
 
 // put the primes on the device
 CUDA_SAFE_CALL(cudaMalloc((void**) &device_primes, sizeof(uint32) * np));
 CUDA_SAFE_CALL(cudaMemcpy(device_primes, primes, sizeof(uint32)*np, cudaMemcpyHostToDevice));
 cutCreateTimer(&thandle);
 cutStartTimer(thandle);
 array_size = (N / 2 / block_size + 1) * block_size;
 numblocks = array_size / block_size;
 primes_per_thread = (np + threadsPerBlock - 1) / threadsPerBlock;
 printf("using %d blocks and %d threads, with %d primes per thread\n",
  numblocks, threadsPerBlock, primes_per_thread);
 CUDA_SAFE_CALL(cudaMalloc((void**) &block_counts, sizeof(uint32) * numblocks));
 CUDA_SAFE_CALL(cudaMemset(block_counts, 0, sizeof(uint32) * numblocks));
 
 dim3 grid((uint32)sqrt(numblocks)+1,(uint32)sqrt(numblocks)+1);
 
 SegSieve<<<grid, threadsPerBlock, 0>>>(block_counts, device_primes, np, primes_per_thread, N);
 cudaThreadSynchronize();  
 
 cutStopTimer(thandle);
 printf("%f milliseconds for big sieve\n",cutGetTimerValue(thandle)); 
 block_counts_on_host = (uint32 *)malloc(numblocks * sizeof(uint32));
 cudaMemcpy(block_counts_on_host, block_counts, sizeof(uint32) * numblocks, cudaMemcpyDeviceToHost);
 
 cudaFree(device_primes);
 cudaFree(block_counts);
 
 uint32 nbig = 0;
 for (int i=0; i<numblocks; i++)
 {
  //printf("found %u primes in block %d\n", block_counts_on_host[i], i);
  nbig += block_counts_on_host[i];
 }
 
 printf("\n%u big primes (< %llu) found\n",nbig,N);
 free(block_counts_on_host);
 free(primes);
 
 return 0;
}

Last fiddled with by bsquared on 2012-06-26 at 18:56 Reason: removed cuPrintf stuff...
bsquared is offline   Reply With Quote
Old 2012-06-27, 13:28   #21
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2·1,789 Posts
Default

Some loop unrolling gave nearly a factor of 2 speedup:

Code:
__global__ static void SegSieve(uint32 *count, uint16 * primes, int maxp, int nump, uint32 N)
{ 
 uint32 i,j,k;
 uint32 bid = blockIdx.y * gridDim.x + blockIdx.x; 
 uint32 range = block_size / threadsPerBlock;
 __shared__ u8 locsieve[block_size];
 // everyone init the array.
 for (j=threadIdx.x * range, k=0; k<range; k++)
 {
  // we're counting hits in the kernel as well, so clear the bytes representing primes > N
  if ((2*(bid * block_size + j + k)+1) < N)
   locsieve[j+k] = 1;
  else
   locsieve[j+k] = 0;
 }
 
 // regroup before sieving
 __syncthreads();
 
 // now sieve the array
 for (j=0; j<nump; j++)
 {
  // + 1 because we don't sieve with p=2
  int pid = (j * threadsPerBlock) + threadIdx.x + 1;
  if (pid < maxp)
  {
   uint32 p = primes[pid];
   uint32 block_start = bid * block_size;
   uint32 start_offset;
 
   // compute the starting location of this prime in this block
   if ((bid == 0) || (p >= block_start))
    start_offset = p/2 + p - block_start;
   else
   {
    uint32 dist = (block_start - p/2);
    uint32 steps = dist / p;
    if ((dist % p) == 0)
     start_offset = (steps * p) + p/2 - block_start;
    else
     start_offset = (steps * p) + p + p/2 - block_start;
   }
   if (p < 1024)
   {
    uint32 stop = block_size - p * 8 + p;
    for(i=start_offset ;i < stop; i+=8*p)
    {
     locsieve[i] = 0;
     locsieve[i+p] = 0;
     locsieve[i+p*2] = 0;
     locsieve[i+p*2+p] = 0;
     locsieve[i+p*4] = 0;
     locsieve[i+p*4+p] = 0;
     locsieve[i+p*4+p*2] = 0;
     locsieve[i+p*7] = 0;
    }
   }
   else if (p < 2048)
   {
    uint32 stop = block_size - p * 4 + p;
    for(i=start_offset ;i < stop; i+=4*p)
    {
     locsieve[i] = 0;
     locsieve[i+p] = 0;
     locsieve[i+p*2] = 0;
     locsieve[i+p*2+p] = 0;
    }
   }
   else if (p < 4096)
   {
    uint32 stop = block_size - p * 2 + p;
    for(i=start_offset ;i < stop; i+=2*p)
    {
     locsieve[i] = 0;
     locsieve[i+p] = 0;
    }     
   }
   else
    i = start_offset;
 
   for( ;i < block_size; i+=p)
    locsieve[i] = 0;
  }
 }
 
 // regroup before counting
 __syncthreads();
 
 // each thread sum a range of the array
 for (j=threadIdx.x * range + 1, k=0; k<range-1; k++)
  locsieve[threadIdx.x * range] += locsieve[j+k];
 
 // regroup before reducing
 __syncthreads();
 
 // finally return the total count in global memory
 if (threadIdx.x == 0)
 {  
  k = 0;
  for (j=0; j<block_size; j+=range)
   k += locsieve[j];
  count[bid] = k;
 }
 
}
Now takes 150 milliseconds to count primes up to 2^30.
bsquared is offline   Reply With Quote
Old 2012-06-27, 17:03   #22
axn
 
axn's Avatar
 
Jun 2003

10100000111112 Posts
Default

have you tried looking at the generated assembly (not ptx, but native isa)? it might give you ideas as to where you can optimize.
axn is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3509 2021-10-22 11:54
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
End of the world as we know it (in music) firejuggler Lounge 3 2012-12-22 01:43
World Cup Soccer davieddy Hobbies 111 2011-05-28 19:21
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 08:35.


Mon Oct 25 08:35:07 UTC 2021 up 94 days, 3:04, 0 users, load averages: 2.26, 1.60, 1.41

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.