Cuda samples download




















One bit is used for the sign, five bits for the exponent, and ten bits for the mantissa. We welcome your input on issues and suggestions for samples.

At this time we are not accepting contributions from the public, check back here as we evolve our contribution model. Skip to content. Star 1. Branches Tags. Could not load branches. Could not load tags. Latest commit. Ru7w1k update lib path for conda. Git stats 53 commits.

Failed to load latest commit information. Jan 13, Dec 11, View code. Introduction 1. Utilities 2. Concepts and Techniques 3. CUDA Features 4. CUDA Libraries 5. Domain Specific 6. CUDA This sample illustrates the usage of CUDA streams to achieve overlapping of kernel execution with data copies to and from the device.

This application demonstrates how to use the new CUDA 4. This sample demonstrates the basic usage of the CUDA occupancy calculator and occupancy-based launch configurator APIs by launching a kernel with the launch configurator, and measures the utilization difference against a manually configured launch.

This sample demonstrates a CUDA 5. This example demonstrates how to pass in a GPU device function from the GPU device static library as a function pointer to be called. This sample requires devices with compute capability 2. This sample uses a new CUDA 4. This sample is a templatized version of the template project. It also shows how to correctly templatize dynamically allocated shared memory arrays.

This sample uses the new CUDA 4. This sample illustrates how to use Zero MemCopy, kernels can read and write directly to pinned system memory. A trivial template project that can be used as a starting point to create new CUDA projects. It is the same as the sample illustrating Chapter 3 of the programming guide with some additions like error checking.

Vector addition kernel demonstrated is the same as the sample illustrating Chapter 3 of the programming guide. This Vector Addition sample is a basic sample that is implemented element by element. This sample also uses the new CUDA 4. This sample replaces the device allocation in the vectorAddDrv sample with cuMemMap-ed allocations. This sample demonstrates that the cuMemMap api allows the user to specify the physical properties of their memory while retaining the contiguous nature of their access, thus not requiring a change in their program structure.

This test application is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device to host copy bandwidth for pageable and page-locked memory. This sample implements a separable convolution filter of a 2D signal with a gaussian kernel. Texture-based implementation of a separable 2D convolution with a gaussian kernel.

Used for performance comparison against convolutionSeparable. This sample demonstrates how to build and use an intercept library with CUDA. This sample demonstrates how Discrete Cosine Transform DCT for blocks of 8 by 8 pixels can be performed using CUDA: a naive implementation by definition and a more traditional approach used in many libraries.

The computation of all or a subset of all eigenvalues is an important problem in Linear Algebra, statistics, physics, and many other fields. This sample demonstrates a parallel implementation of a bisection algorithm for the computation of all eigenvalues of a tridiagonal symmetric matrix of arbitrary size with CUDA. This sample illustrates how to use function pointers and implements the Sobel Edge Detection filter for 8-bit monochrome images.

This sample demonstrates two adaptive image denoising techniques: KNN and NLM, based on computation of both geometric and color distance between texels.

While both techniques are implemented in the DirectX SDK using shaders, massively speeded up variation of the latter technique, taking advantage of shared memory, is implemented in addition to DirectX counterparts.

A simple test application that demonstrates a new CUDA 4. Interval arithmetic operators example. The recursive mode requires Compute SM 2. This sample uses CUDA to simulate and visualize a large set of particles and their physical interaction.

This example implements a uniform grid data structure using either atomic operations or a fast radix sort from the Thrust library. This sample demonstrates a very fast and efficient parallel radix sort uses Thrust library. The included RadixSort class can sort either key-value pairs with float or unsigned integer keys or keys only. A parallel sum reduction that computes the sum of a large arrays of values.

This sample demonstrates single pass reduction using Multi Block Cooperative Groups. This sample requires devices with compute capability 6. This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.

This sample demonstrates an approach to the image segmentation trees construction. This method is based on Boruvka's MST algorithm. This sample implements bitonic sort and odd-even merge sort also known as Batcher's sort , algorithms belonging to the class of sorting networks. While generally subefficient, for large sequences compared to algorithms with better asymptotic algorithmic complexity i. Refer to an excellent tutorial by H.

This sample shows how to perform a reduction operation on an array of values using the thread Fence intrinsic to produce a single value in a single kernel as opposed to two or more kernel calls as shown in the "reduction" CUDA Sample. Single-pass reduction requires global atomic instructions Compute Capability 2. CUDA contexts can be created separately and attached independently to different threads.

This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. This sample is a simple code that illustrates binary partition cooperative groups and reduce within the thread block. This sample requires devices with compute capability 3.

In addition to that, it demonstrates the use of the new CUDA function attribute cudaFuncAttributeMaxDynamicSharedMemorySize that allows the application to reserve an extended amount of shared memory than it is available by default.

Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads. This sample implements matrix multiplication which uses asynchronous copy of data from global to shared memory when on compute capability 8.

Also demonstrates arrive-wait barrier for synchronization. This sample demonstrates how graph memory nodes re-use virtual addresses and physical memory. For CUDA 5. This sample demonstrates how using Cooperative Groups CG to perform warp aggregated atomics to single and multiple counters, a useful technique to improve performance when many threads atomically add to a single or multiple counters.

This function expects a single channel 8-bit grayscale input image. The Canny Edge Detection function combines and improves on the techniques required to produce an edge detection image using multiple steps. This sample implements a conjugate gradient solver on multiple GPUs using Multi Device Cooperative Groups, also uses Unified Memory optimized using prefetching and usage hints. Currently only supported on Ubuntu This sample demonstrates how any border version of an NPP filtering function can be used in the most common mode, with border control enabled.

Mentioned functions can be used to duplicate the results of the equivalent non-border version of the NPP functions. They can be also used for enabling and disabling border control on various source image edges depending on what portion of the source image is being used as input. This sample is an implementation of a simple line-of-sight algorithm: Given a height map and a ray originating at some observation point, it computes all the points along the ray that are visible from the observation point.

The implementation is based on the Thrust library. This sample implements matrix multiplication from Chapter 3 of the programming guide. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming both into frequency domain, multiplying them together, and transforming the signal back to time domain.

In this example, CUFFT is used to compute the 2D-convolution of some signal with some filter by transforming both into frequency domain, multiplying them together, and transforming the signal back to time domain on Multiple GPU. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming both into frequency domain, multiplying them together, and transforming the signal back to time domain on Multiple GPU.

It can be used in image recovery and denoising. Each pixel is weight by considering both the spatial distance and color distance between its neighbors. Tomasi, R. To use the makefiles, change the current directory to the sample directory you wish to build, and run make:.

See the Linux Installation Guide for a list of supported host compilers. The Mac samples are built using makefiles. To use the makefiles, change directory into the sample directory you wish to build, and run make:.

See the Mac Installation Guide for a list of supported host compilers. These dependencies are listed below. If a sample has a third-party dependency that is available on the system, but is not installed, the sample will waive itself at build time. These third-party dependencies are required by some CUDA samples.

If available, these dependencies are either installed on your system automatically, or are installable via your system's package manager Linux or a third-party website. FreeImage is an open source imaging library. FreeImage can usually be installed on Linux using your distribution's package manager system. FreeImage can also be downloaded from the FreeImage website. A MPI compiler can be installed using your Linux distribution's package manager system.

It is also available on some online resources, such as Open MPI. DirectX is a collection of APIs designed to allow development of multimedia applications on Microsoft platforms.



0コメント

  • 1000 / 1000