Understanding Parallel Computing: Gpus Vs Cpus Explained Simply With Role Of Cuda

1 month ago

ARTICLE AD BOX

Introduction

In 1996, NVIDIA entered nan 3D accelerator marketplace initially down nan competition. However, done changeless learning and improvement, they achieved awesome occurrence successful 1999 pinch nan preamble of nan GeForce 256, recognized arsenic nan first graphics insubstantial termed a GPU. Initially designed for gaming, GPUs later recovered a plethora of business applications successful math, science, and engineering.

In 2003, Ian Buck and his squad introduced Brook, nan first wide embraced programming exemplary that expanded C by incorporating data-parallel constructs. Buck later played a cardinal domiciled astatine NVIDIA, starring nan 2006 motorboat of CUDA, nan first commercially disposable solution for general-purpose computing connected GPUs.

CUDA serves arsenic nan connecting span betwixt Nvidia GPUs and GPU-based applications, enabling celebrated dense learning libraries for illustration TensorFlow and PyTorch to leverage GPU acceleration. This capacity is important for optimizing dense learning tasks and underscores nan worth of utilizing GPUs successful nan field. Today, CUDA is wide considered basal for immoderate AI development, and is simply a package constituent of immoderate AI betterment pipeline.

Prerequisites

Basic Computer Architecture
- Understand what CPUs and GPUs are and their superior functions.
- Familiarity pinch cores, threads, and nan wide conception of computation.
Introduction to Parallelism
- Grasp nan value betwixt serial and parallel processing.
- Awareness of tasks that usage from parallelism, specified arsenic matrix operations.
Programming Fundamentals
- Basic knowledge of programming languages for illustration Python aliases C/C++.
- Experience pinch loops, conditional statements, and functions.
CUDA Overview
- High-level knowing of CUDA arsenic a exemplary for parallel computing connected NVIDIA GPUs.
- Recognize CUDA’s domiciled successful enabling developers to represent programs that utilization GPU parallelism.

What is Parallel Computing?

In simpler terms, parallel computing is simply a measurement of solving a azygous problem by breaking it down into smaller chunks and solving each 1 simultaneously. Instead of having 1 powerful instrumentality complete 1 analyzable process, parallel computing involves utilizing aggregate computers aliases processors to activity connected different pieces of nan problem astatine nan aforesaid time. This capacity onslaught speeds up nan process of handling ample tasks and efficiently handles nan tasks. This is akin to nan onslaught of a having a squad of co-workers handling different assignments simultaneously successful bid to meet immoderate extremity together. Together, nan smaller workers create an exponential summation successful wide processing speeds.

CUDA successful Simpler Terms

CUDA aliases Compute Unified Device Architecture created by Nvidia is simply a package level for parallel computing. It has been utilized successful galore business problems since its popularization successful nan mid-2000s successful various fields for illustration instrumentality graphics, finance, accusation mining, instrumentality learning, and technological computing. CUDA enables accelerated computing done its specialized programming language, compatible pinch astir operating systems.

GPU vs CPU

A CPU, aliases cardinal processing unit, serves arsenic nan superior computational information successful a server aliases machine, this instrumentality is known for its divers computing tasks for nan operating strategy and applications. The CPU is responsible for executing mathematical and logical calculations successful our computer. The superior usability of this information is to tally code, handling tasks specified arsenic copying files, deleting data, and processing personification inputs. Moreover, nan CPU acts arsenic a mediator for relationship betwixt different instrumentality peripherals, ensuring they don’t consecutive interact but spell done nan CPU.

While it whitethorn look that nan CPU tin multitask, each halfway of nan CPU tin only grip 1 task astatine a time. Each halfway operates arsenic an independent processing unit, and nan expertise to multitask is wished by nan number of cores successful nan hardware. Generally, 2 to 8 cores per CPU is tin for immoderate tasks a laymen whitethorn need, and capacity of these CPUs are alternatively businesslike to nan constituent that humans can’t moreover announcement that our tasks are being executed successful a bid alternatively of each astatine once. This is nan suit for astir each nan things we usage CPUs for connected a regular basis.

Whereas, a graphics processing information (GPU) is simply a specialized hardware constituent that is tin of efficiently handling parallel mathematical operations, surpassing nan general-purpose capabilities of a CPU. Initially designed for graphics rendering successful gaming and animation, GPUs personification evolved now to execute a broader scope of tasks beyond their original scope. However, immoderate of them are instrumentality hardware designed to grip definite tasks.

Let’s return a look astatine immoderate earthy numbers. If we spot nan astir advanced, personification CPU systems to mostly beryllium equipped pinch 16 cores, nan astir advanced, consumer-grade GPU (Nvidia RTX 4090) has 16,384 CUDA cores. This value is only magnified erstwhile looking astatine H100s, which personification 18,432 CUDA cores. Those CUDA cores are mostly small powerful than individual CPU cores, and we cannot make nonstop comparisons. However, nan sheer measurement of nan CUDA cores by comparison should show why they are comparatively cleanable for handling ample amounts of computations successful parallel.

When comparing CPUs and GPUs, it mightiness look for illustration a bully thought to solely spot connected GPUs owed to their parallel processing capabilities. However, nan petition for CPUs continues, because multitasking isn’t ever nan astir businesslike approach. We too usage CPUs for wide computing that would beryllium almost excessively elemental for GPUs. In definite scenarios, executing tasks sequentially tin beryllium overmuch clip and resource-effective than parallel processing. The advantage of CUDA lies successful its expertise to seamlessly move betwixt CPU and GPU processing for circumstantial tasks. This elasticity allows programmers to strategically find erstwhile to utilize which hardware component, providing enhanced powerfulness complete nan computer’s operations.

CUDA’s Role successful GPU

You tin look astatine nan CUDA type and GPU info by typing nvidia-smi into your terminal. In a Notebook cell, we tin do this by adding a ! astatine nan commencement of nan line.

!nvidia-smi

Once we personification confirmed our instrumentality has everything we petition group up, we tin import nan Torch package. It too has a bully CUDA checker usability we tin usage to guarantee that Torch was decently installed and tin observe CUDA and nan GPU.

import torch use_cuda = torch.cuda.is_available

In this suit it will return ‘True’

or,

if torch.cuda.is_available(): instrumentality = torch.device('cuda') else: instrumentality = torch.device('cpu') print("using", device, "device")

With CUDA, programmers tin creation and instrumentality parallel algorithms that return advantage of nan thousands of cores coming successful modern GPUs. This parallelization is important for computationally intensive tasks specified arsenic technological studies, instrumentality learning, video editing and accusation processing. CUDA provides a programming exemplary and a group of APIs that alteration developers to represent codification that runs consecutive connected nan GPU, unlocking nan imaginable for important capacity gains compared to accepted CPU-based computing. By offloading parallelizable workloads to nan GPU, CUDA plays a cardinal domiciled successful enhancing nan computational capabilities of GPUs and driving advancements successful high-performance computing applications.

Source

Speed Test

Let america effort to get immoderate accusation astir nan cuda type and nan GPU,

if device: print('__CUDA VERSION:', torch.backends.cudnn.version()) print('__Number CUDA Devices:', torch.cuda.device_count()) print('__CUDA Device Name:',torch.cuda.get_device_name(0)) print('__CUDA Device Total Memory [GB]:',torch.cuda.get_device_properties(0).total_memory/1e9)

CUDA VERSION: 8302 __Number CUDA Devices: 1 __CUDA Device Name: NVIDIA RTX A4000 __CUDA Device Total Memory [GB]: 16.89124864

We will behaviour 3 velocity tests to comparison nan capacity of CPU versus GPU. Additionally, for nan 4th test, we will make a synthetic dataset utilizing unchangeable diffusion and measurement nan velocity astatine which nan A4000 GPU tin successfully complete nan task.

For penning this demo, we chose to usage an NVIDIA RTX A4000. This demo should activity connected immoderate GPU aliases CPU machine.

Matrix Divison

The Python codification beneath performs matrix conception utilizing immoderate CPU and GPU, and it measures nan clip it takes for nan cognition connected each device.

The codification creates random matrices, and performs nan cognition connected nan CPU, transfers nan matrices to nan GPU, and past measures nan clip taken for nan aforesaid cognition connected nan GPU. The loop repeats this process 5 times for overmuch meticulous timing results for nan GPU. The torch.cuda.synchronize() ensures that nan GPU computation is complete earlier measuring nan time.

import time matrix_size = 43*15 x = torch.randn(matrix_size, matrix_size) y = torch.randn(matrix_size, matrix_size) print("######## CPU SPEED ##########") start = time.time() result = torch.div(x,y) print(time.time() - start) print("verify device:", result.device) x_gpu = x.to(device) y_gpu = y.to(device) torch.cuda.synchronize() for 1 in range(5): print("######## GPU SPEED ##########") commencement = time.time() result_gpu = torch.div(x_gpu,y_gpu) print(time.time() - start) print("verify device:", result_gpu.device)

As we tin see, nan computations were importantly faster connected nan GPU than nan CPU.

Build a Artificial Neural Network

The beneath python codification will built a elemental neural web exemplary utilizing immoderate nan CPU and GPU to show a basal velocity test.

import tensorflow as tf import time data_size = 10000 input_data = tf.random.normal([data_size, data_size]) model = tf.keras.Sequential([ tf.keras.layers.Dense(1000, activation='relu', input_shape=(data_size,)), tf.keras.layers.Dense(1000, activation='relu'), tf.keras.layers.Dense(1) ]) model.compile(optimizer='adam', loss='mse') def speed_test(device): with tf.device(device): start_time = time.time() model.fit(input_data, tf.zeros(data_size), epochs=1, batch_size=32, verbose=0) end_time = time.time() return end_time - start_time cpu_time = speed_test('/CPU:0') print("Time taken connected CPU: {:.2f} seconds".format(cpu_time)) gpu_time = speed_test('/GPU:0') print("Time taken connected GPU: {:.2f} seconds".format(gpu_time))

Build a Convolutional Neural Network (CNN)

The beneath codification will train a Convolutional Neural Network (CNN) connected nan MNIST dataset utilizing TensorFlow. The speed_test usability measures nan clip taken for training connected immoderate CPU and GPU, allowing to comparison their performance.

import tensorflow as tf from tensorflow.keras import layers, models import time (train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data() train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255 test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255 model = models.Sequential() model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(64, (3, 3), activation='relu')) model.add(layers.MaxPooling2D((2, 2))) model.add(layers.Conv2D(64, (3, 3), activation='relu')) model.add(layers.Flatten()) model.add(layers.Dense(64, activation='relu')) model.add(layers.Dense(10, activation='softmax')) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) def speed_test(device): with tf.device(device): start_time = time.time() model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_data=(test_images, test_labels), verbose=0) end_time = time.time() return end_time - start_time cpu_time = speed_test('/CPU:0') print("Time taken connected CPU: {:.2f} seconds".format(cpu_time)) gpu_time = speed_test('/GPU:0') print("Time taken connected GPU: {:.2f} seconds".format(gpu_time))

Create a Synthetic Emotions Dataset pinch Stable Diffusion

Next, fto america effort creating a synthetic dataset pinch Stable Diffusion by creating 10 images of different emotions specified arsenic angry, sad, lonely, happy. Follow nan steps beneath to recreate nan dataset.

Please connection nan beneath codification will require a GPU

First, we petition to instal nan basal libraries.

!pip instal --upgrade diffusers transformers scipy !pip instal --quiet ipyplot

Please, make judge to restart nan kernel erstwhile nan libraries supra are installed, aliases this whitethorn not work.

Install nan basal packages, and specify nan exemplary id of nan pre-trained model.

We will delegate nan drawstring “cuda” to nan adaptable device. This indicates that nan codification intends to usage a CUDA-enabled GPU for computation.

import torch from torch import autocast from diffusers import StableDiffusionPipeline import ipyplot import random import os import time import matplotlib.pyplot as plt model_id = "CompVis/stable-diffusion-v1-5" device = "cuda"

Create an suit of nan StableDiffusionPipeline group by loading nan pre-trained exemplary specified successful nan adaptable model_id. The from_pretrained method is commonly utilized successful dense learning frameworks to instantiate a exemplary and load pre-trained weights if available.

pipe = StableDiffusionPipeline.from_pretrained(model_id) pipe = pipe.to(device)

Create nan circumstantial files to shop nan images,

os.makedirs('/notebooks/happy', exist_ok=True) os.makedirs('/notebooks/sad', exist_ok=True) os.makedirs('/notebooks/angry', exist_ok=True) os.makedirs('/notebooks/surprised', exist_ok=True) os.makedirs('/notebooks/lonely', exist_ok=True)

The adjacent lines of codification will make images utilizing nan StableDiffusionPipeline for different emotions and genders. It does truthful successful a loop, creating 10 images for each emotion.

genders = ['male', 'female'] emotion_prompts = {'happy': 'smiling', 'surprised': 'surprised, opened mouth, raised eyebrows', 'sad': 'frowning, sad look expression, crying', 'angry': 'angry, fierce, irritated', 'lonely': 'lonely, alone, lonesome'} print("######## GPU SPEED ##########") start = time.time() for j in range(10): for emotion in emotion_prompts.keys(): emotion_prompt = emotion_prompts[emotion] gender = random.choice(genders) punctual = 'Medium-shot image of {}, {}, beforehand view, looking astatine nan camera, colour photography, '.format(gender, emotion_prompt) + \ 'photorealistic, hyperrealistic, realistic, incredibly detailed, crisp focus, integer art, grade of field, 50mm, 8k' negative_prompt = '3d, cartoon, anime, sketches, (worst quality:2), (low quality:2), (normal quality:2), lowres, normal quality, ((monochrome)), ' + \ '((grayscale)) Low Quality, Worst Quality, plastic, fake, disfigured, deformed, blurry, bad anatomy, blurred, watermark, grainy, signature' image = pipe(prompt=prompt, negative_prompt=negative_prompt).images[0] image.save('/notebooks/{}/{}.png'.format(emotion, str(j).zfill(4))) print(time.time() - start)

Now, let’s tally nan code, and look astatine really agelong nan task takes for nan A4000 GPU speeds, and past usage a flimsy alteration to comparison it pinch nan CPU speeds.

And then, to put our pipeline onto nan CPU, simply usage nan pursuing snippet earlier moving nan aforesaid code:

pipe.to('cpu')

This will get america our CPU times, shown below.

As we tin see, nan CPU was importantly slower. This is because images are represented by computers arsenic arrays of numbers, and performing nan multitude of parallel processes connected a GPU is conscionable overmuch overmuch efficient.

Results

Here is an overview of each of our analyses from this blog post. GPUs were consistently faster crossed each of these accusation cognition and instrumentality learning tasks.

Speed Test

Tasks GPU CPU

Matrix Operation	5.8e-05(avg)	0.00846
ANN	2.78	23.30
CNN	48.31	167.68
Stable Diffusion	121.03	3153.04

Conclusion

The pairing of CUDA pinch NVIDIA GPUs holds a ascendant position successful various exertion domains, peculiarly successful nan conception of dense learning. This cognition serves arsenic a cornerstone for powering immoderate of nan world’s ace computers.

CUDA and NVIDIA GPU personification successfully powered industries specified arsenic Deep Learning, Data Science and Analytics, Gaming, Finance, Researches and galore more. For suit Deep learning dense relies connected accelerated computing, peculiarly GPUs and specialized hardware for illustration TPUs.

The usage of GPUs importantly accelerates nan training process, reducing it from months to a week. Various dense learning frameworks, including TensorFlow, PyTorch, and others, dangle connected CUDA for GPU support and cuDNN for dense neural web computations. Performance gains are shared crossed frameworks erstwhile these underlying technologies improve, but differences successful scalability to aggregate GPUs and nodes beryllium among frameworks.

In summary we tin opportunity that erstwhile picking a GPU for dense learning aliases immoderate A.I. tasks 1 of nan things to support successful mind is nan GPU should support CUDA.

We dream you enjoyed reference nan article.