mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2014-04-13, 00:11   #1
tapion64
 
Apr 2014

8510 Posts
Default CUDA Atomic Access

I want to write a simple kernel which calls out to a second kernel to perform a computation (a quadratic at an integer). Then we take the result of that operation and find it mod n (for our purposes, it's the same n in each calculation). Then, it needs to access a global variable, multiply the value by the residue we found, and take it mod n again.

The order does not matter so long as none of them get interrupted. Unfortunately it seems that every thread trying to access 1 variable is not going to yield good performance (and also it appears there's no way to do an atomic multiply).

How do I solve the multiplication problem? Is it just not possible? Is there an alternative way to get around it? I feel like there should be a way to accomplish this. I just want to read-multiply-store.

On the atomic blocking side, my idea was to have each block have a shared variable and do the multiplication with the block's threads on the shared var, then when all the blocks are done merge back and multiply their respective values to take the final value mod n. Will this result in better performance because less threads are blocked trying to write to the global? Is there a better implementation?
tapion64 is offline   Reply With Quote
Old 2014-04-13, 05:07   #2
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

5·709 Posts
Default

This is a standard example of a reduction operation, for which there's plenty of sample code available. Most reduction examples compute a sum of a bunch of numbers in a single kernel, and for a modular multiply accumulation within a single block the code is basically identical.

However, while there is an atomic add instruction, CUDA does not have an atomic modular multiply so there is no way to accumulate multiplies safely across multiple blocks. So you have two choices:

- use CUDA code to simulate a mutex, or

- have each block accumulate its own product and store in a global memory array, then have a single block of a separate kernel do the final accumulation of that array

The second option is much preferred.
jasonp is offline   Reply With Quote
Old 2014-04-13, 13:42   #3
tapion64
 
Apr 2014

5×17 Posts
Default

Thanks, I figured that was the case.
tapion64 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Anyone with access to Maple? xilman Programming 28 2012-04-12 03:24
No access to mersenneforum anymore Cedric Vonck Forum Feedback 2 2008-01-19 23:48
Too Much Internet Access. M0CZY Software 3 2005-10-17 15:41
Need access to a PowerPC G4 and G5 ewmayer Hardware 0 2005-05-03 22:15
Access violation error Unregistered Hardware 7 2005-04-23 11:56

All times are UTC. The time now is 07:19.


Sat Jul 2 07:19:16 UTC 2022 up 79 days, 5:20, 0 users, load averages: 1.50, 1.42, 1.39

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔