mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2017-04-19, 03:07   #2597
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

13×373 Posts
Default fftlength fallback

Quote:
Originally Posted by kriesel View Post
On a Quadro 2000, testing the limits of fft length crashes the program, without completing any benchmarks as far as I can tell. Observed repeatable on the only Quadro 2000 I've yet run it on.

$ cudalucas2.05.1-cuda4.2-windows-x64 -cufftbench 1 65536 5 >>bigfftbench.txt
CUDALucas.cu(1055) : cudaSafeCall() Runtime API error 2: out of memory.

On a GeForce GTX480, it will run many fft lengths and output timings to stdout, then terminate before reaching 65536 or producing the fft lengths file. At least on mine. Then scaling back the maximum to what it reached on stdout produces a file.

Quadro 2000 has 1GB VRAM, GTX480 has 1.5.
If the program can not allocate sufficient card memory, have it fallback to the next smaller fftlength and state it's doing so; recursively until sufficient memory can be allocated. On the gtx480, the maximum fftlength that can be benchmarked is somewhat a function of the 2.051 CUDA version. On the Quadro 2000 it's pretty constant. Running the same executables on the two cards the GTX480 outputs many fftlength timings to stdout before hitting the limit or finishing; on the Quadro2000 the program doesn't output any fftlength timings before halting with the out of memory error if started with a reasonable lower fftlength and a too-large upper fftlength. GTX480 will do at least up to 58320; Quadro2000 up to 38880. (Both do 38880/GB)

Last fiddled with by kriesel on 2017-04-19 at 03:15 Reason: add empirical data
kriesel is offline   Reply With Quote
Old 2017-04-29, 16:19   #2598
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

100010101102 Posts
Default

OK, got my hand on another set of P100, this time the higher clocked Tesla P100-SXM2-16GB. Compared to the Tesla P100-PCIE-16GB these babies have 300W TDP (instead of 250W) and a bit higher clock rates (base and boost clocks). Memory bandwidth is exactly the same.. and CUDALucas performance too! So seems like CUDALucas is completely memory bandwidth bound on P100!

Oliver
TheJudger is offline   Reply With Quote
Old 2017-05-08, 00:49   #2599
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009
U.S.A.

2×3×7×43 Posts
Default

2.06 Beta runs very well on my hardware. The only difference I see is the round-off error values are a bit higher. With 2.05, they stayed in the 0.05 range. With the Beta, they are running 0.62 to 0.75. I don't know if this is worth mentioning, but I guess it won't hurt.

Thanks!
storm5510 is offline   Reply With Quote
Old 2017-05-08, 04:16   #2600
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

13·373 Posts
Default 2.06 beta

Quote:
Originally Posted by storm5510 View Post
2.06 Beta runs very well on my hardware. The only difference I see is the round-off error values are a bit higher. With 2.05, they stayed in the 0.05 range. With the Beta, they are running 0.62 to 0.75. I don't know if this is worth mentioning, but I guess it won't hurt.

Thanks!
Hmm. What hardware? In my experience, on multiple card types, a round-off error >0.50 causes the program to terminate. I'm also seeing and documenting multiple other issues and notifying flashjh as they're found. How many other beta testers are out there?
kriesel is offline   Reply With Quote
Old 2017-05-08, 23:45   #2601
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

12F116 Posts
Default

Quote:
Originally Posted by kriesel View Post
Hmm. What hardware? In my experience, on multiple card types, a round-off error >0.50 causes the program to terminate. I'm also seeing and documenting multiple other issues and notifying flashjh as they're found. How many other beta testers are out there?
Correction, cudapm1 terminates if round-off error exceeds 0.40;cudalucas changes fftlength and resumes from the last checkpoint if the round-off error exceeds 0.35.
kriesel is offline   Reply With Quote
Old 2017-05-09, 04:11   #2602
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009
U.S.A.

2·3·7·43 Posts
Default

Quote:
Originally Posted by kriesel View Post
Hmm. What hardware? In my experience, on multiple card types, a round-off error >0.50 causes the program to terminate. I'm also seeing and documenting multiple other issues and notifying flashjh as they're found. How many other beta testers are out there?
HP workstation with an i5-3570 running Windows 10 Pro x64.. NVidia GTX-750Ti running CUDA 8.

Edit: I just noticed a serious error in my post about the Beta. The round-off error values should be 0.062 to 0.075. I fudged the decimal point. Sorry!

Last fiddled with by storm5510 on 2017-05-09 at 04:20 Reason: Correction
storm5510 is offline   Reply With Quote
Old 2017-06-03, 17:53   #2603
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

13·373 Posts
Default Warning: Device numbers can change! Request features to mitigate

Feature request: device confirmation

There is a chain of events that can lead to multiple instances of CUDALucas (or a mix with cudapm1, mfaktc etc), launched with different -d numbers, unintentionally running on the same physical GPU at the same time, with considerable slowdown as a result of sharing one device and perhaps multiplexing overhead. This can also lead to other unintended effects, depending on the various gpu cards' speeds, memory capacity, and reliability differences, and the types of runs. This sequence has been observed on V2.05.1 repeatedly.

I believe it could affect all currently available versions that support multiple device numbers in use on multiple-gpu systems. I have not detected incorrect Lucas Lehmer test results produced as a result of this unintentional gpu sharing or switching occurring; I haven't ruled it out either. I believe it very likely to produce incorrect fft or threads benchmark results.

Please consider adding device confirmations, such as gpu model, Bus ID, or bios version, as command line options. This would both help identify when the events occur, so that they can be addressed, and help prevent or reduce the negative impact when the event chain occurs.

More detail follows.

Part one, status quo
--------------------

Example system, Condorette: 3 gpus, all running CUDALucas, mixed versions, Windows 7 64 bit & current on updates, NVIDIA driver 378.66
gtx1050Ti v2.06beta -d 0,
Quadro 2000 v2.05.1 -d 1,
Quadro 2000 v2.05.1 -d 2;
q2000 timings logged went to ~double-duration (1/2 speed when running two instances on q2000's, days after the last system change; from 35-37 ms/iter to ~76 ms/iter on same exponents; persists for weeks until system shutdown).
BIOS versions and bus numbers of the cards as reported by GPU-Z instances differ, as follows:
A -d 0 GTX 1050 Ti 86.07.22.00.50 Bus 40 Device 0 furthest from CPUs physically
B -d 1 q2000 1 70.06.0D.00.02 Bus 15 Device 0 middle card
C -d 2 q2000 2 70.06.31.02.02 Bus 28 Device 0 closest to CPUs physically

Example without the options requested (in separate directories, run inside batch files):
a) CUDALucas2.06beta-CUDA6.5-Windows-win32.exe -d 0 >>cl.txt
b) CUDALucas2.05.1-CUDA6.5-Windows-win32.exe -d 1 >>cl.txt
c) CUDALucas2.05.1-CUDA6.5-Windows-win32.exe -d 2 >>cl.txt

At startup all cards are cool and the above example is fine. But if/when the B card overheats, and shuts down in self-defense, there's a driver timeout detected and Windows attempts to restart it. If the gpu is still too hot since it's being heated by other cards nearby, the gpu fails to restart. (In my experience, the gpu generally does not restart until one or both neighbors' workload is halted so things cool down, and the system is restarted. Rarely, if I recall correctly, the gpu will be restartable hours later.) So now the a and c instances are running, and the corresponding GPU-Z instances can read their respective sensors, but the sensor data and actual clock rate data for B are no longer available. The b instance of cudalucas detects an issue and terminates. The batch file it's launched from launches a new executable using option -d 1 again, which since the B GPU is now invisible, -d 1 is now apparently gpu C according to Windows. Meanwhile instance c of CUDALucas still is successfully running on gpu C, so the b and c instances of CUDALucas timeshare gpu C, with some performance loss. If instance c is later stopped and relaunched, it will fail to find a device 2, print the usual error message and halt, leaving instance b the full use of GPU C. I have confirmed this chain of events by observing logged iteration timing changes after manual instance halts and restarts. Changes in clock rates in GPU-Z instances corresponding to each physical gpu also confirm. In the case of a Quadro 2000, shutdown occurs after GPU-Z displays 98 C GPU temperature. Stated temperature limits vary by GPU model; GTX480 limit is 105 C; GTX1070 94 C; GTX 1050 Ti 97 C. (I've found Quadro 2000 & 4000 temperature limits are not stated in NIVIDIA spec sheets.)

If the highest device number card overheats and drops out, device remapping does not occur. If any other does, it's possible that multiple cards can remap to lower device numbers. For example, in a 4-GPU system, if -d 0 drops out, all the others move down by one for new launches. The effect of a card dropping out varies. In one system I have a high reliability gpu as -d 0 running cudalucas and a low default reliability gpu (needs to be clocked slower than its default rates) as -d 1 doing p-1. If the high reliability one dropped out, and the iffy gpu has restarted with a higher clock rate that produces errors, a restart of CUDALucas would put continuation of a lengthy LL test on the iffy gpu and cause bad residues, which I want to avoid. (Tools like MSI Afterburner or EVGA Precision XOC can adjust the clock rates, but I've observed the rates don't stay set, eventually resetting to the higher value that reenables memory errors. I have not yet tried modifying and flashing the gpu BIOS to cap the rates lower.) In the case of remapping putting a run on a different model gpu with lower memory, it might fail to execute, for the larger fft lengths. It could cause real havoc during benchmarking runs, giving timings from the wrong model gpu, or some fft benchmark timings double what they should be due to accidentally sharing a GPU for part of a benchmark run, or a run failing to complete because of resource limits such as memory size differing between different model gpus.

Part two, proposed feature additions
------------------------------------

Example with the requested options added & possible syntax:
CUDALucas2.06beta-CUDA6.5-Windows-win32.exe -d 0 -m "Quadro 4000" -p "Bus 40 Device 0" -b 86.07.22.00.50 -warn >>cl.txt
CUDALucas2.06beta-CUDA6.5-Windows-win32.exe -d 1 -m "Quadro 2000" -p "Bus 15 Device 0" -b 70.06.0D.00.02 -halt >>cl.txt
CUDALucas2.06beta-CUDA6.5-Windows-win32.exe -d 2 -m "Quadro 2000" -p "Bus 28 Device 0" -b 70.06.31.02.02 >>cl.txt

Here,
-m is gpu model identifier (as CUDALucas logs it in -info output; these may not be unique within a system, and some are not on mine)
-p is pci bus identifier (I think these are unique within one system)
-b is Gpu BIOS version identifier (I haven't seen a case where these match within a system, but it might happen)

Ideally, -m -p and -b could be present in any of the permutations (none, any one, any two, or all three; in any order).
Options -warn or -halt could be limited to occur only following whatever not-null selection of -b -m -p is used.

Note, bios versions can be very different or very similar. Examples I've seen include:
86.07.22.00.50 all 5 fields differ; 8 of 10 digits differ
70.06.31.02.01
70.06.31.02.02 right field differs by one

In this case, the executable queries the hardware selected as -d whatever, for matches to its specific expected parameters specified for model, bus id, and/or bios version (preferably unique within the system). It prints to stdout the confirmation options specified and the responses obtained. If there's a mismatch in any it warns if -warn is specified, and goes ahead computing anyway; it halts if a mismatch is detected and -halt is specified. If neither -warn nor -halt are specified, it warns of a mismatch in stdout and proceeds to execute. (Default -warn) The current behavior is equivalent to don't detect, don't warn. I suppose an option -nowarn could also be included. I don't see much utility in that.

(end)
kriesel is offline   Reply With Quote
Old 2017-06-05, 17:28   #2604
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

13·373 Posts
Default recovery from illegal residues

Ugh. Overnight a CUDALucas V2.05.1 run goes from fine, to pointless, just over half way through. This was not the system or gpu that was overheating previously.

| Jun 04 23:44:45 | M80409907 42500000 0x1b919bf0d5dc5667 | 4320K 0.29297 37.1880 743.76s | 19:19:45:29 52.85% |
| Jun 04 23:57:09 | M80409907 42520000 0x26de355425603463 | 4320K 0.28125 37.2359 744.71s | 19:19:25:02 52.87% |
| Jun 05 00:09:28 | M80409907 42540000 0x0000000000000002 | 4320K 0.28125 36.9652 739.30s | 19:19:04:25 52.90% |
| Jun 05 00:20:54 | M80409907 42560000 0x0000000000000002 | 4320K 0.23906 34.3121 686.24s | 19:18:42:01 52.92% |

... until manual detection ~10am.

About three gpu-weeks lost. System backups failed. Now have savefiles ini parameter set to one, executable switched to 2.06beta, and launched.

V2.05 would continue to compute meaningless 0x02 residues at full fft length.
The April 18 beta computes 0x02 residues until the next checkpoint, & terminates after detecting illegal residue. Batch file wrapper relaunches it, and repeat.
May 5 build 2.06beta does the same as the April 18 build.
So I conclude an illegal residue and bad c and t files produced by v2.051 are detected but not recovered from by v206beta thru April 18 or May 5 build.

I manually renamed the c and t files to prevent restarting from them. After a bit of running, the savefiles directory looks like:
Directory of C:\Users\Ken\My Documents\cl-quadro2000-2\savefiles

06/05/2017 11:03 AM .
06/05/2017 11:03 AM ..
06/05/2017 10:38 AM 0 .empty.txt
06/05/2017 11:59 AM 10,051,280 s80409907.100000.fda890b7e00cf3cd.cls
06/05/2017 11:03 AM 10,051,280 s80409907.14136.73bcfcd608d670c6.cls
06/05/2017 10:38 AM 10,051,280 s80409907.43616986.0000000000000002.cls
06/05/2017 10:52 AM 10,051,280 s80409907.43618538.0000000000000002.cls
5 File(s) 40,205,120 bytes

It looks like the naming convention for savefiles is
s[exponent].[iteration].[hex 64-bit residue].cls.
What's the procedure for actually using a valid savefile if needed?
Halt CUDALucas, delete or rename or move c and t files, move or copy an s file and rename it to be a c or t file, relaunch CUDALucas?
The readme file covers creation of savefiles but seems to be silent about use of savefiles.

(end)
kriesel is offline   Reply With Quote
Old 2017-06-23, 21:47   #2605
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

13×373 Posts
Default ETA oddities

If a CUDALucas lucas lehmer test is moved while in progress, from a slow gpu to a faster gpu, the total required time apparently is not recalculated for the second gpu, so the ETA does not match what calculation with the iteration speed of the faster gpu would indicate. Excerpts from an example run follow, with some comments interspersed.

Continuing M80443463 @ iteration 39520001 with fft length 4320K, 49.13% done

| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Jun 22 11:39:57 | M80443463 39540000 0xc5ddef5105433496 | 4320K 0.34375 13.6289 272.56s | 24:22:45:04 49.15% |
| Jun 22 11:44:30 | M80443463 39560000 0xb1e6ea9c33c1a3fe | 4320K 0.32813 13.6285 272.57s | 24:22:14:02 49.17% |
| Jun 22 11:49:03 | M80443463 39580000 0x43b2046313d7a7b3 | 4320K 0.31250 13.6283 272.56s | 24:21:43:02 49.20% |
| Jun 22 11:53:36 | M80443463 39600000 0x7fc26f721831079a | 4320K 0.29688 13.6281 272.56s | 24:21:12:04 49.22% |
| Jun 22 11:58:08 | M80443463 39620000 0xd6bab5c2754eac22 | 4320K 0.30859 13.6589 273.17s | 24:20:41:08 49.25% |
| Jun 22 12:02:41 | M80443463 39640000 0x66aca7edf49b4250 | 4320K 0.31250 13.6275 272.55s | 24:20:10:13 49.27% |
| Jun 22 12:07:13 | M80443463 39660000 0xe2578942679761ce | 4320K 0.30884 13.6281 272.56s | 24:19:39:19 49.30% |
| Jun 22 12:11:46 | M80443463 39680000 0x751514eb98ae8498 | 4320K 0.29688 13.6285 272.57s | 24:19:08:28 49.32% |
| Jun 22 12:16:19 | M80443463 39700000 0x3af1e6f8d1d5b626 | 4320K 0.31250 13.6287 272.57s | 24:18:37:37 49.35% |
| Jun 22 12:20:52 | M80443463 39720000 0xddb86eb30a41f069 | 4320K 0.30078 13.6584 273.16s | 24:18:06:49 49.37% |
| Jun 22 12:25:24 | M80443463 39740000 0xb0f8e032c204279b | 4320K 0.32031 13.6276 272.55s | 24:17:36:02 49.40% |

0.0136276 seconds times 80443463 times (1-0.494) = about 6.42 days, not 24.71 days.
The ETA is dropping about 30.8 minutes in about 5.5 minutes, a ratio of about 5.57.
This example occurred moving an exponent from a Quadro 2000 to a GTX 1050 Ti.
Version is CUDALucas v2.06beta 32-bit Windows build, compiled May 5 2017 @ 12:33:52

Hours later, after fft size changes, it adjusts ETA:

| Jun 22 17:55:21 | M80443463 41180000 0xa62927d07ad536a8 | 4320K 0.31250 13.6277 272.55s | 23:05:46:37 51.19% |
Round off error at iteration = 41191900, err = 0.35938 > 0.35, fft = 4320K.
Restarting from last checkpoint to see if the error is repeatable.

Using threads: square 32, splice 256.

Continuing M80443463 @ iteration 41180001 with fft length 4320K, 51.19% done

Round off error at iteration = 41191900, err = 0.35938 > 0.35, fft = 4320K.
The error persists.
Trying a larger fft until the next checkpoint.

Using threads: square 32, splice 256.

Continuing M80443463 @ iteration 41180001 with fft length 4608K, 51.19% done


| Jun 22 18:05:07 | M80443463 41200000 0x0cb6bac9222e41be | 4608K 0.07031 13.0346 260.67s | 5:22:05:24 51.21% |
Resettng fft.

Using threads: square 32, splice 256.

Continuing M80443463 @ iteration 41200001 with fft length 4320K, 51.22% done

| Jun 22 18:09:40 | M80443463 41220000 0xc5e12eeadc77d748 | 4320K 0.31641 13.6281 272.54s | 6:04:29:02 51.24% |

Perhaps expected duration is calculated only when beginning an exponent or changing fft size. How about doing it at checkpoint file save intervals?
kriesel is offline   Reply With Quote
Old 2017-06-24, 20:27   #2606
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

484910 Posts
Default CUDALucas feature suggestion: run time estimation for entire worktodo file

Prime95 estimates completion date/time for multiple entries in its worktodo file. It's very handy for scheduling.
CUDALucas computes estimated time of arrival of completion (ETA), of the current exponent only, while in progress.

It would be useful to have an option in CUDALucas to read the entire worktodo file, and compute and display time span and completion date for each line of the worktodo file, at the outset of a run following the usual header and optional info section.
Option could be -w or -work.
The output might look something like the following (for a GTX 1050 Ti or similar speed card):

Code:
 Work Queue Status
Start-date start-time exponent  current-iteration run-length iteration-time total-time-est completion-estimate %-complete 
  Jun 22    18:05:07  M80443463      41200000        4608K      13.0346       12d 3:13:40    Jun 28  16:10:31    51.21%  
  Jun 28    16:10:31  M43161917             0        2304K       6.0141        3d 0:06:20    Jul 01  16:16:51     0.00%  
  (any additional exponents queued would follow in list)

 Total of  2 exponents queued, occupying estimated 15d 3:20:00 total, 8d 22:11:44 remaining.
Draft pseudocode (without having looked at the existing source code):

Code:
 Get current date and time
Set start date and time for first work as current date and time
Zero exponent count, estimated run time total, estimated remaining time total
open worktodo file for read
Output header for work table
While (!EOF) {
  read a line of worktodo
  parse to obtain exponent
  increment count of valid lines containing work
  if there is a checkpoint file for the exponent in the working directory to resume from{
    read it to obtain fft length and saved iteration number (or error handling if read fails or values obtained are not valid)
  } else {
    determine fft length for the exponent
    assume zero iterations performed
  }
  perhaps, if it is the first valid work line, save exponent for resumption or start after work estimation
  look up the iteration time for the fft length
  compute duration in seconds as (exponent-2) iterations times iteration-time divided by 1000
  compute percent-done as iterations done / (exponent-2) * 100
  compute remaining duration for this exponent as duration * (1 - percent-done/100)   compute completion date and time as start date and time plus remaining duration
  output line
  set start date and time (for next worktodo line) as completion date and time for this line's exponent
  increment count of valid lines containing work
  add this exponent's estimated time to total estimated run time
  add this exponent's estimated remaining duration to total remaining time estimate
}
close worktodo
output summary line with count of exponents and total estimated run time.
(continue; resume or start first exponent in worktodo, if there is at least one valid work line)
One could consider also re-estimating the worktodo each time after completing an exponent.

Comments?

(end)

Last fiddled with by kriesel on 2017-06-24 at 20:40 Reason: distinguish full run time versus remaining run time totals
kriesel is offline   Reply With Quote
Old 2017-07-11, 03:12   #2607
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

484910 Posts
Default

Quote:
Originally Posted by kriesel View Post
If the program can not allocate sufficient card memory, have it fallback to the next smaller fftlength and state it's doing so; recursively until sufficient memory can be allocated. On the gtx480, the maximum fftlength that can be benchmarked is somewhat a function of the 2.051 CUDA version. On the Quadro 2000 it's pretty constant. Running the same executables on the two cards the GTX480 outputs many fftlength timings to stdout before hitting the limit or finishing; on the Quadro2000 the program doesn't output any fftlength timings before halting with the out of memory error if started with a reasonable lower fftlength and a too-large upper fftlength. GTX480 will do at least up to 58320; Quadro2000 up to 38880. (Both do 38880/GB)
The above are for the extreme case of requesting an fftbench run from 1k to some maximum. I subsequently found that with separate runs, higher values could be benchmarked. Make a run 1-38880, then 38880 to a higher value, and repeat, and manually combine the separate runs into one fft file. This was done to explore the limits of the software and hardware. The higher fft lengths have little practical value on slower GPUs, since a full LL run would take several years, perhaps longer than the remaining reliable-function life of a GPU card. A line in an fft file like
16384 294471259 132.2527
corresponds to more than a year, and
32768 580225813 282.8326
to an LL test run taking up to 5.2 years of a Quadro 2000,
38880 685923253 396.8878
up to 8.6 years.

FFT or threads benchmarking to wide ranges of fft lengths has revealed some serious anomalies on certain GPU and CUDA-level combinations.

Last fiddled with by kriesel on 2017-07-11 at 03:14
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't DC/LL them with CudaLucas LaurV Data 131 2017-05-02 18:41
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 Brain GPU Computing 13 2016-02-19 15:53
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
settings for cudaLucas fairsky GPU Computing 11 2013-11-03 02:08
Trying to run CUDALucas on Windows 8 CP Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 05:45.

Tue Jan 26 05:45:20 UTC 2021 up 54 days, 1:56, 0 users, load averages: 2.40, 2.37, 2.33

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.