What to Run on Cloud GPUs: A Practical Guide to LLMs, Diffusion, and Vector Databases
-
About
- Type
- Blog
- Categories
- Cloud InfrastructureKubernetes
About
Table of contents
Posted on 18 October 2025
Cloud GPUs can feel like cheat codes in the LLM race, until the bill lands or the latency SLO slips. Not every AI workload deserves a GPU, and even “GPU-worthy” jobs can waste money if you pick the wrong tier, precision, or batching strategy. Without a framework, you’re always at risk of overprovisioning expensive hardware for jobs that don’t need it, or underpowering critical inference services that demand low latency and high throughput.
This guide cuts through the noise with a workload-first playbook. We’ll map out when GPUs truly deliver value, and when they don’t, using clear heuristics like VRAM fit, p95 latency targets, and throughput scaling. From large language model (LLM) inference and fine-tuning, to diffusion-based image and video generation, to GPU-accelerated vector databases, you’ll see concrete formulas and sizing examples tailored for UpCloud’s GPU offerings.
If you’re trying to decide what to run on cloud GPUs, how to size the hardware correctly, and how to keep costs predictable, this guide gives you the clarity and formulas you need to move forward with confidence.
On UpCloud, GPU instances are delivered via dedicated PCIe passthrough, combining predictable performance with the same high-speed networking and MaxIOPS storage stack trusted across our cloud platform.
GPUs are not magic; they’re highly specialized hardware designed for a narrow but powerful set of tasks. To know when they help, you need to understand the break-even point between CPUs and GPUs. Four factors matter most: compute parallelism, memory bandwidth, latency targets, and energy efficiency.
That said, not every workload deserves a GPU. There are several common don’t-use-GPU cases:
To separate hype from reality, here are a few simple litmus tests you can use:
On UpCloud, GPUs are provisioned through full PCIe passthrough. You get isolated, bare-metal-level access with no noisy-neighbor effects, making them ideal for latency-sensitive inference workloads. This gives you dedicated access to the GPU without noisy-neighbor effects, but also means you don’t get NVLink-style interconnect between multiple GPUs. Multi-GPU setups are possible, but they require careful sharding or distributed inference frameworks; you can’t assume automatic cross-GPU memory pooling.
Combine these GPU servers with UpCloud’s private networking to keep inference and data transfer inside your secure environment and MaxIOPS block storage for high-speed model checkpoints or LoRA weights.
Not all GPU-backed workloads are created equal. Some are latency-bound and demand tightly optimized inference, others are throughput-heavy and benefit from batch parallelism, and still others only need GPUs intermittently for index building or fine-tuning. To make sense of it, let’s break down the three most common GPU-hungry categories: LLMs, diffusion/image/video generation, and vector databases.
Large language models are the poster child for cloud GPUs, but not all workloads look alike. Serving a chat assistant demands fast, low-latency responses, while fine-tuning a 70B parameter model is a whole different challenge.
Chat vs. Batch Inference
For chat-style apps like support bots, the metric to watch is p95 latency. Users expect responses within ~200–300 ms per token, in some cases, even sub-200 ms is the general expectation. GPUs enable this by pinning the KV cache, the running memory of the conversation, into VRAM. If the cache spills to CPU RAM, latency jumps dramatically.
Batch inference (e.g., summarizing hundreds of docs) shifts the focus to throughput. Here, GPUs shine again through micro-batching, though CPUs may suffice if runtimes aren’t urgent.
Precision and Quantization
Model size is governed by precision: FP32 is wasteful, FP16/bfloat16 is today’s sweet spot, and INT8/FP8 halves memory again with minimal loss. Quantization methods (like QLoRA or GPTQ) shrink models 4–8×, often small enough to fit consumer-grade GPUs.
A 13B model in FP16 needs ~26 GB VRAM. Including the KV cache and overhead, this can increase to approximately 35 GB. Using INT8 quantization halves the model weights VRAM to around 13 GB, and 4-bit quantization reduces it further to roughly 7 GB, both before adding KV cache and additional overhead. A rule of thumb for VRAM is (bytes per parameter × parameter count) + ~25% overhead.
Fine-Tuning Approaches
Full fine-tuning updates every weight and often requires 8–16 high-memory GPUs for a 70B model. LoRA makes training more accessible by only updating small adapter layers, touching <1% of parameters. QLoRA takes this further by combining LoRA with quantization, letting you fine-tune very large models on a single GPU.
Don’t overlook checkpointing costs, though. Checkpointing during fine-tuning involves saving the model’s weights and training state periodically to storage. This consumes significant disk space, especially for very large models, and can slow down training due to the time taken to write checkpoints. Balancing checkpoint frequency is important to avoid excessive storage use and runtime overhead.
If LLMs test GPUs with sequential token generation and memory juggling, diffusion and generative media workloads push them with sheer compute demand and sudden VRAM spikes. Text-to-image, text-to-video, and even audio generation are some of the most GPU-intensive jobs you can run in the cloud.
Latency vs. Batch Size
Diffusion models like Stable Diffusion and SDXL generate images by iteratively denoising a latent representation. For latency-sensitive applications such as design tools or consumer-facing platforms, the goal is under 2 seconds per image at the 95th percentile. This requires smaller batch sizes, optimized kernels, and sometimes quantization.
Bulk jobs like dataset creation prioritize throughput, where larger batches enable GPUs to denoise many samples in parallel efficiently. For example, on a 24 GB GPU, generating a 512×512 SDXL image typically takes between 1.5 and 2.5 seconds, depending on batch size and GPU model; increasing resolution to 1024×1024 raises latency to approximately 4 to 6 seconds per image.
VRAM Spikes in Video Generation
Video workloads amplify these resource demands. Each frame requires its own diffusion pass, and ensuring smooth motion commonly means holding multiple frames in GPU memory simultaneously. For reference, generating a 3-second clip at 24 frames per second with 512×512 resolution can increase VRAM usage by 30–40% compared to processing a single image frame.
In contrast, a 10-second 1024×1024 clip may require GPUs with 48 GB or more of VRAM such as NVIDIA A6000 or H100 or strategies like splitting frames across multiple GPUs to manage memory load effectively. Video generation workloads often necessitate multi-GPU setups that distribute frames across cards and combine results in post-processing.
Guidance and Optimization Knobs
Guidance scale (how closely the model follows a prompt) improves quality but slows each step, and overtuning here is a common source of wasted GPU hours. Other levers include quantization (FP16 as standard, INT8/FP8 cutting memory by ~30%), memory-efficient kernels like FlashAttention, and distributed inference frameworks such as DeepSpeed-Inference to spread work across GPUs.
Practical Sizing
Interactive apps should budget 24–48 GB VRAM per GPU, bulk image generation benefits from larger batches, and video workloads should assume spikes that call for 80 GB GPUs or multi-GPU orchestration. Latency-critical APIs need aggressive kernel optimizations, careful batching, and tuned guidance scales.
Vector databases are the quieter third pillar of GPU workloads. While LLMs and diffusion models grab the spotlight, retrieval (powering similarity search across billions of embeddings) often benefits just as much from GPU acceleration.
Why GPUs Matter
Vector search boils down to calculating distances like cosine or dot products across high-dimensional vectors. At scale, each query can mean millions of multiply-add operations, which GPUs handle far more efficiently than CPUs thanks to thousands of cores and massive memory bandwidth. GPUs also speed up index construction: building structures like HNSW or IVF-PQ that might take hours or days on CPUs can finish in minutes on GPUs, critical for teams that retrain embeddings often.
Hybrid CPU–GPU Models
Not all workloads need GPUs full-time. A common strategy is hybrid execution: CPUs handle query routing, metadata filters, or cold queries, while GPUs process hot-path similarity search, high-QPS traffic, or large index builds. For example, a semantic search API may run mostly on CPUs, with the busiest 10% of queries routed to a GPU-backed FAISS or Milvus cluster for speed.
Sizing by Dimensions and Dataset Size
Two factors dominate GPU needs:
Practical Sizing Tips
Theory is useful, but when you’re on the hook for picking instance sizes, you need quick rules that map models and workloads to GPU tiers. Here are formulas and heuristics that cut through guesswork for LLMs, diffusion models, and vector databases.
The first question: does it fit in memory?
Here’s a formula you can use to estimate it:
VRAM_needed ≈ (parameters × bytes_per_param)
+ (context_length × hidden_dim × 2 × bytes_per_param)
+ overhead(20–30%)
Here,
Here are a few example runs:
Here’s a formula to map throughput to user concurrency:
QPS ≈ (tokens_per_second × batch_size) / avg_tokens_per_request
If a 13B model yields 100 tokens/sec on a single GPU, and the average response is 200 tokens:
Some tips:
Break your latency into three buckets:
Rule of thumb:
For image/video generation, think in frames × steps:
Latency_per_output ≈ (num_steps × cost_per_step) / GPU_parallelism
VRAM_spike ≈ base_model + (frames × latent_dim × precision_bytes)
Examples:
There are two main dimensions to keep in mind here: dataset size and embedding dimensionality.
Memory ≈ (num_vectors × dimensions × bytes_per_dim)
Here’s a formula to estimate throughput:
Latency ≈ (num_vectors / GPU_parallelism) × distance_cost
(where distance_cost ≈ d multiply-adds per vector).
Some tips:
Here’s a decision table summing up everything related to sizing:
| Workload | Latency Goal | Dataset/Model Size | Recommended GPU Tier | Notes |
| Chat LLM (7B) | <200 ms p95 | 4k context | 24 GB | INT8 or FP16, KV-cache pinned |
| Chat LLM (13B) | <300 ms p95 | 8k context | 48 GB | Needs quantization or 2×24 GB |
| RAG Pipeline (13B + Vector) | <500 ms p95 | 100M vectors | 48–80 GB + CPU pool | GPU for LLM, hybrid for vectors |
| Batch Summarization (70B) | Latency relaxed | 1000 docs/hour | Multi-GPU 80 GB | Shard across nodes |
| Image Generation (512²) | <2 s p95 | Single images | 24 GB | Batching optional |
| Video Generation (1024²) | <10 s/clip | 10s @ 24fps | 80 GB+ multi-GPU | Frame chunking required |
| Vector Search (10M) | <10 ms p95 | 768-dim FP16 | 24 GB | Hot index only |
| Vector Search (100M+) | <50 ms p95 | 768-dim FP16 | Multi-GPU 48–80 GB | IVF-PQ compression |
| Key takeaway: Most sizing errors come from underestimating VRAM overhead and overestimating achievable QPS. These formulas aren’t perfect, but they give you back-of-envelope accuracy close enough to avoid 10× overprovisioning mistakes. Use them as a first filter before experimenting with actual benchmarks. |
Choosing the right GPU is about aligning VRAM, interconnects, and accelerator features with your workload. The wrong tier can leave you overspending on idle memory or bottlenecked by I/O.
VRAM is the gating factor for almost every GPU workload.
Rule of thumb to follow here: pick the smallest VRAM tier that can hold your workload comfortably with room for overhead.
UpCloud exposes GPUs via PCIe passthrough, which means you get dedicated access with no noisy neighbors. Unlike time-shared GPU slices, passthrough ensures predictable latency and throughput, crucial for production inference.
Caveats:
Modern GPUs include Tensor Cores, which are specialized units for accelerating matrix multiplies. They’re optimized for lower-precision math (FP16, FP8, INT8), delivering up to 10× speedups over FP32.
Practical impact:
When evaluating a GPU, check which precision modes its Tensor Cores accelerate, and make sure your framework supports them.
GPUs don’t work in isolation. Pairing them correctly matters:
When does multi-GPU pay off?
Cloud GPUs deliver unmatched acceleration, but their economics are unforgiving. The only way to control costs is to model them explicitly:
Here’s a general cost model you can use for reference:
Total Cost = (GPU $/hr × GPU-hours)
+ (CPU $/hr × CPU-hours)
+ (Storage $/GB × GB)
+ (Network $/GB × GB)
Cost per Unit = Total Cost ÷ Units Produced
Where Units = tokens, images, or queries served. GPU-hours scale with request rate, latency targets, and utilization efficiency (batch size, quantization, KV-cache reuse).
Using this model, let’s try estimating bills of materials for a few different use cases
(Note: Pricing numbers below are illustrative and not tied to current UpCloud catalog rates. Always check the latest GPU pricing in the UpCloud control panel.)
Let’s assume the workload to be: 7B LLM (INT8), ~100 RPS, 200 ms p95
Here’s what the BOM would look like on a very high level:
| Component | Usage / Qty | Unit Price | Monthly Cost | Notes |
| GPU (24–48 GB) | ~1,000 hrs | $2.5/hr | $2,500 | Quantization halves VRAM & cost |
| CPU Pool | 8–16 vCPUs, 730 hrs | $0.08/hr | $500 | Routing & orchestration |
| Storage | 100 GB | $0.10/GB | $10 | Checkpoints & snapshots |
| Network | ~2 TB egress | $0.09/GB | $180 | User traffic |
Work Produced: ~259B tokens/month
The final cost per token comes out to be ≈ $0.0011/token.
Here, let’s define the workload as: SDXL, 512² images, aiming for <2 s p95.
Here’s what the BOM would look like (again, on a very high level):
| Component | Usage / Qty | Unit Price | Monthly Cost | Notes |
| GPUs (2×48 GB) | ~1,500 hrs | $3/hr | $9,000 | Batch size biggest lever |
| CPU Pool | 32 vCPUs, 730 hrs | $0.10/hr | $2,300 | Pre/post-processing |
| Storage | 200 GB | $0.10/GB | $20 | Models + LoRA weights |
| Network | ~10 TB egress | $0.09/GB | $900 | CDN offload advised |
Work Produced: ~100K images/month
The final cost per image comes out to be ≈ $0.12/image. If you implement batching, the cost can drop to ~$0.06.
The workload for this case is: FAISS index for 1B embeddings @ 768d
Here’s what the BOM would look like:
| Component | Usage / Qty | Unit Price | Monthly Cost | Notes |
| GPU (80 GB) | ~50 hrs | $5/hr | $250 | Rented only for builds |
| CPU Pool | 32 vCPUs, 730 hrs | $0.15/hr | $1,100 | Query serving |
| Storage | 2 TB SSD | $0.10/GB | $200 | Embeddings + index |
| Network | ~1 TB egress | $0.09/GB | $90 | Query responses |
The cost per 1000 queries comes out to be ≈ $0.16/1K queries.
In all three cases, the numerator (GPU, CPU, storage, network) is mostly fixed by instance choice and workload design. The denominator (tokens, images, queries) is where efficiency lives. Quantization, batch size, KV-cache reuse, and LoRA fine-tuning increase throughput per GPU-hour, slashing cost per unit.
The article so far has talked about concepts and walked through specific examples, but teams often need a more general guide to help decide on CPU vs GPU vs hybrid, GPU class, and quantization based on their requirements, such as model size, QPS, latency, workload type, budget, etc.
The decision matrix below aims to be a quick way to reason through hardware choices without wading into formulas every time.
| Inputs | Decision Output | Notes / References |
| Model ≤ 1B params, <10 RPS, p95 latency >500ms | CPU | Small models, low traffic, or cacheable jobs. CPUs can outperform GPUs at low throughput; GPU memory transfer overhead is unwarranted in such cases. |
| Model 7B–13B params, 20–50 RPS, p95 ≤200ms | Single GPU (24–48GB VRAM) | GPUs recommended for models above 7B; quantize to INT8/FP8 if VRAM is tight. CPU cannot keep up at this scale for prompt latency targets. |
| Model 13B–70B params, >50 RPS, p95 ≤200ms | Multi-GPU (NVLink preferred) | Multi-GPU setups with NVLink or fast PCIe preferred; pipeline/tensor parallelism is necessary. Pin KV-cache in VRAM for transformer inference speed. |
| Fine-tuning with LoRA/QLoRA | Single GPU (24–48GB VRAM) | These methods allow large models to be fine-tuned on consumer-grade GPUs via quantization and adaptation. |
| Full model fine-tuning (70B) | Multi-GPU cluster (80GB VRAM GPUs) | Full model fine-tuning requires sharding and robust checkpointing, typically run on high-VRAM clusters. |
| Diffusion / Image gen (≤512² res, batch ≤4) | Single GPU (24GB VRAM) | FP16 is standard for efficient performance and memory usage. INT8 is experimental but used for further speed. Steps/scheduler dominate latency, not precision. |
| Diffusion / Video gen (HD+, multi-frame) | Multi-GPU or high-VRAM GPU (48–80GB) | Video workloads spike memory per frame; multi-GPU or high-VRAM required with a job queue. |
| Vector DB: build a large ANN index | GPU (sidecar or dedicated) | Benchmarks show GPUs dramatically speed up HNSW/IVFPQ index build. |
| Vector DB: steady low-QPS search | CPU (primary) / Hybrid (GPU for hot paths) | Hybrid: CPU for lower QPS, GPU for hot paths; most QPS can be handled by the CPU. |
All the heuristics and matrices in the world don’t matter unless you can translate them into a reliable deployment. UpCloud’s dedicated GPU servers make this straightforward: GPUs are exposed via PCIe passthrough, paired with fast storage and private networking. That gives you predictable performance without noisy-neighbor effects, but it also means you need to think explicitly about scaling and orchestration.
These dedicated GPU instances use the same infrastructure stack as our standard compute VMs fast NVMe-based MaxIOPS storage, private networks for cluster communication, and snapshot support for reproducible deployments.
Provision GPU-enabled servers in the UpCloud control panel or CLI (upctl server create). Attach them to your private network to keep inference traffic off the public internet, reducing both latency and egress charges. For reproducibility, you can snapshot your VM once the model runtime and weights are installed. This avoids slow re-deploys when scaling up.
For fine-tuning or vector index builds, you’ll need persistent volumes. Attach block storage to your GPU VMs for model checkpoints, then snapshot volumes regularly to protect against preemptions or crashes. For diffusion workloads, consider separate storage tiers: fast SSD for active weights, and object storage for archiving generated assets.
A common pattern is to run your core services (Postgres, vector DB, orchestration layer) on managed CPU servers, and treat GPUs as worker sidecars. You can queue jobs via Redis or a lightweight message bus, and then schedule onto GPU workers. This decouples orchestration from inference and makes autoscaling much easier.
For latency-bound APIs, autoscale GPU nodes on queue depth or request rate. For batch or offline jobs, scale on time windows (nightly index builds, retrains). If you use spot-priced instances for background jobs, wire checkpointing to persistent volumes so you don’t lose progress on preemption.
Cloud GPUs offer immense power, but they need to be applied with precision. Not every workload deserves a GPU, and even GPU-worthy jobs can waste budget if you overlook VRAM fit, quantization, or utilization. The key is to align GPU use with the actual needs of your workload and the cost constraints of your project.
This guide laid out sizing formulas, heuristics, and decision matrices to help you match models to the right GPU tier. We also walked through UpCloud-specific best practices (things like private networking, snapshots, GPU worker sidecars) that make real-world deployments more efficient. With the right framework, you can deploy LLMs, diffusion models, and vector databases in ways that balance performance and cost.
From there, monitoring and autoscaling ensure your GPUs stay optimized as demand shifts. UpCloud’s dedicated GPU servers, combined with predictable performance, make it simple to experiment, measure, and scale without guesswork. The playbook is in your hands: build, measure, optimize, and scale with confidence.
Explore GPU-ready instances, private networking, and storage options in the UpCloud Control Panel to start deploying your next AI workload today.
Q: What types of AI workloads are best suited for cloud GPUs?
A: Cloud GPUs are ideal for compute-heavy AI workloads like large language models (LLMs), diffusion models for image and video generation, and video transcoding. These tasks benefit from the parallelism and high memory bandwidth that GPUs provide.
Q: How to choose the right cloud GPU instance for specific tasks like LLM inference or training diffusion models?
A: Match the GPU to your workload: use high-VRAM instances (24–80GB) for LLM inference and training, and high-throughput GPUs (like A100 or H100) for diffusion and video workloads. Always consider VRAM fit, precision (FP16/INT8), and batch size requirements.
Q: What are the cost implications and optimization strategies for running these workloads on cloud GPUs?
A: Cloud GPUs are costly, but you can optimize with quantization, mixed precision, and autoscaling. Right-sizing instances and using spot or reserved pricing can significantly cut costs.
Q: How to effectively scale AI models on cloud GPUs, including multi-GPU setups and auto-scaling?
A: Use distributed training frameworks like DeepSpeed or Horovod for multi-GPU scaling. Pair this with cloud-native auto-scaling policies to match GPU resources to fluctuating demand.
Q: What are the latency considerations when using cloud GPUs for real-time inference with LLMs or diffusion?
A: Latency is driven by VRAM fit, batch size, and GPU type. For real-time inference, prioritize low-latency GPUs and keep models quantized to fit entirely in memory.
Q: How to integrate vector databases with LLMs to improve retrieval-augmented generation (RAG)?
A: Connect vector databases like pgvector, Pinecone, or Weaviate with LLMs to retrieve relevant embeddings before generating responses. This enhances factual accuracy and context in RAG workflows.
Q: What are the practical use cases of vector databases in AI, including semantic search and recommendation systems?
A: Vector databases power semantic search, recommendation engines, anomaly detection, and personalization. They make AI outputs more relevant by efficiently handling similarity searches across embeddings.
Q: How to configure and deploy GPU-enabled services on cloud platforms like Google Cloud Run?
A: Use containerized applications with GPU-enabled base images and configure GPU allocation in the platform’s service settings. Ensure your container supports CUDA drivers and libraries.
Q: What are the best practices for monitoring and managing GPU resource utilization and performance?
A: Track GPU utilization, VRAM usage, and latency with tools like Prometheus, Grafana, or NVIDIA DCGM. Monitoring helps avoid bottlenecks and ensures efficient cost-performance trade-offs.Q:
Q: What software and containerization strategies optimize running AI workloads on cloud GPUs?
A: Use lightweight containers with CUDA-optimized libraries, and adopt frameworks like PyTorch with mixed precision. Multi-stage Docker builds and orchestration with Kubernetes streamline deployment.