{"id":91,"date":"2025-10-02T16:14:30","date_gmt":"2025-10-02T13:14:30","guid":{"rendered":"https:\/\/upcloud.com\/global\/us\/2025\/10\/02\/beyond-pytorch-vs-tensorflow-2026\/"},"modified":"2025-10-02T16:14:30","modified_gmt":"2025-10-02T13:14:30","slug":"beyond-pytorch-vs-tensorflow-2026","status":"publish","type":"post","link":"https:\/\/upcloud.com\/global\/blog\/beyond-pytorch-vs-tensorflow-2026\/","title":{"rendered":"Beyond PyTorch vs. TensorFlow 2026"},"content":{"rendered":"\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>Frontends \u00b7 Compilers \u00b7 Serving:<\/strong> <\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The modern AI stack is layered. You build and train in a frontend, you optimize execution with a compiler, and you expose models through a serving plane. Underneath sit <a href=\"https:\/\/upcloud.com\/global\/products\/gpu-servers\/\" target=\"_blank\" rel=\"noreferrer noopener\">GPUs<\/a>, <a href=\"https:\/\/upcloud.com\/global\/docs\/products\/networking\/\" target=\"_blank\" rel=\"noreferrer noopener\">networking<\/a>, <a href=\"https:\/\/upcloud.com\/global\/products\/block-storage\/\" target=\"_blank\" rel=\"noreferrer noopener\">storage<\/a>, and observability. These choices determine developer speed, cold start, throughput, and how you operate in production. As load grows, infrastructure matters as much as the stack itself, whether you run it in-house or on a provider like UpCloud.<\/p>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color wp-elements-97e568f4b5b888c3a603e2a0c83920ca wp-block-paragraph\"><strong>How to use this guide<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Pick your frontend first. It sets the day-to-day developer experience.<\/li>\n\n\n\n<li>Add a compiler path for latency and cost.<\/li>\n\n\n\n<li>Choose a serving plane that fits your API shape and hardware.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color wp-elements-6c0ee8ed166fc8d9437dbeeb09d53874 wp-block-paragraph\"><strong>Keep layers straight<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Frontend = authoring and training.<\/li>\n\n\n\n<li>Compiler = graph capture and optimization.<\/li>\n\n\n\n<li>Serving = runtime that exposes an API.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Frontends: building and training<\/strong><\/h2>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color wp-elements-8e67aedfdbb4236f36186060f44a360b wp-block-paragraph\"><strong>PyTorch<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use when: rapid prototyping, custom layers, research workflows, unstructured data.<\/li>\n\n\n\n<li>Strengths: Pythonic API, dynamic computation graphs, straightforward debugging, strong ecosystem for vision and NLP.<\/li>\n\n\n\n<li>Trade-offs: more integration work to harden long-lived pipelines; production patterns vary by team.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color wp-elements-523750713145d62076d80f9a7e3b5e2c wp-block-paragraph\"><strong>TensorFlow<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use when: mature pipelines, device breadth, or deep alignment with Google Cloud and TPUs.<\/li>\n\n\n\n<li>Strengths: deployment tooling, TF Serving and TF Lite paths, good cross-device support, strong ecosystem for scale.<\/li>\n\n\n\n<li>Trade-offs: steeper learning curve for fast iteration; more ceremony when experimenting.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color wp-elements-65e30188b47d89fe15e5457762af3a3e wp-block-paragraph\"><strong>Keras 3<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use when: portability across backends or teaching and knowledge transfer.<\/li>\n\n\n\n<li>Strengths: one high-level API that runs on PyTorch, TensorFlow, or JAX with minimal code changes; easier onboarding across teams.<\/li>\n\n\n\n<li>Trade-offs: less low-level control; performance and features are inherited from the selected backend.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color wp-elements-92d7d9f672e7396c973ad0a729618d30 wp-block-paragraph\"><strong>Quick picks<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If you need control and fast iteration, choose PyTorch.<\/li>\n\n\n\n<li>If you need enterprise pipelines or GCP and TPU integration, choose TensorFlow.<\/li>\n\n\n\n<li>Need portability and long-term maintainability across backends: choose Keras 3.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>Don\u2019t mix the layers.<\/strong><\/h2>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Before going further, here\u2019s a rule of thumb: framework \u2260 compiler \u2260 server. Think of it like this:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-04f0656f90e81477b4334f8cda7f4427\"><strong>Frontend (framework)<\/strong> is what you <em>code with<\/em>. PyTorch, TensorFlow, or Keras. This is where you build and train models.<\/li>\n\n\n\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-5bb9a61c5ff3e642dfaf73eb9e76c9c3\"><strong>Compiler<\/strong> is what you use to <em>optimize<\/em> the model&#8217;s execution. torch.compile, torch.export, XLA, these decide how your model runs under the hood, not what the model is.<\/li>\n\n\n\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-8616dd7ef52ce932e88c20d32c3ff62c\"><strong>Serving stack<\/strong> is how you <em>deploy<\/em> that model to production. TF Serving, Triton\/TensorRT-LLM, vLLM, these are the runtime environments for inference.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/upcloud.com\/media\/tf-serving1-4.png\" alt=\"-\" class=\"wp-image-65905\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>Compiler layer: Speed, startup, and serialization<\/strong>.<\/h2>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Compilers determine performance, cold-start, and portability.  <\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>torch.compile: Opt\u2011in graph capture (Inductor backend by default)<\/strong>.<\/h3>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\"><code><mark class=\"has-inline-color has-vivid-purple-color\">torch.compile()<\/mark><\/code> wraps your module for JIT\u2011style graph capture via TorchDynamo and compiles with Inductor unless you choose another backend. Great when you want speed without changing your code:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">import torch\nmodel = MyModel().eval().cuda()\ncompiled = torch.compile(model)   # explicit; eager is still the global default\ny = compiled(x)<\/code><\/pre>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Typical speedups depend on model and warmup. Unsupported ops or dynamic shapes may trigger recompiles or fall back to eager.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>torch.export + AOTInductor: Ahead-of-time for production<\/strong>.<\/h3>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Export to a stable graph, compile ahead of time, and package as a shared library you can load in Python or non-Python runtimes. Improves startup and enables lean server processes.<\/p>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Important: <code>torch._inductor.aoti_compile_and_package<\/code> and <code>aoti_load_package <\/code>are prototype APIs with evolving behavior. Validate artifacts against your PyTorch minor version before promoting to production. Treat artifacts and APIs accordingly.<a href=\"https:\/\/docs.pytorch.org\/tutorials\/recipes\/torch_export_aoti_python.html?\" target=\"_blank\" rel=\"noreferrer noopener\"> PyTorch Documentation<\/a><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">import torch\nfrom torch.export import export, Dim\n\nep = export(model, (dummy_input,),\n            dynamic_shapes={\"x\": {0: Dim(\"batch\", 1, 1024)}})\n\ntorch._inductor.aoti_compile_and_package(\n    ep, package_path=\"model.pt2\"\n)<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>XLA: Accelerated static graphs for TensorFlow and JAX<\/strong>.<\/h3>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">XLA is part of <strong>OpenXLA<\/strong> and powers compilation in TensorFlow and JAX, optimizing graphs across hardware. In TensorFlow you can enable it via <code><mark class=\"has-inline-color has-vivid-purple-color\">tf.function(jit_compile=True)<\/mark><\/code>. <a href=\"https:\/\/openxla.org\/xla?\" target=\"_blank\" rel=\"noreferrer noopener\">OpenXLA Project<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Choosing a compiler path<\/strong>.<\/h3>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\" style=\"font-size:20px\"><table class=\"has-cyan-bluish-gray-color has-text-color has-link-color has-fixed-layout\"><tbody><tr><td><strong>Need<\/strong><\/td><td><strong>Pick<\/strong><\/td><\/tr><tr><td>Fast, dynamic workloads<\/td><td>torch.compile()<\/td><\/tr><tr><td>Production inference w\/ low latency<\/td><td>torch.export + <strong>AOTInductor<\/strong><\/td><\/tr><tr><td>TensorFlow-heavy teams on TPUs<\/td><td>XLA (jit_compile=True)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>Serving &amp; observability layer: Getting your model into the real world<\/strong>.<\/h2>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-ce8f00a8b30f43bfde88a6c0f4295f52 wp-block-paragraph\">Training a model is just half the journey. The next step is <strong><em>serving<\/em><\/strong> making your model accessible so it can answer real-world requests, like image classifications, language completions, or predictions on live data.<\/p>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-748b9e7923ac0f55d112ff0da2a7af52 wp-block-paragraph\">If you&#8217;re new to this:<br><strong>Model Serving is how your trained model becomes an API<\/strong>, something a website, app, or other system can call to get results. Think of it as shipping your model into production.<\/p>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-1b4b2494442a77f498a1bd75f7ec154e wp-block-paragraph\">If you&#8217;re already deploying:<br>You know that serving isn\u2019t just about speed. It\u2019s also about reliability, scale, and visibility into how your model performs over time. Here&#8217;s how the current serving stacks compare:<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>TF Serving: Mature, reliable, TensorFlow-native<\/strong>.<\/h3>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">TensorFlow Serving is the go-to for production TensorFlow models. It supports REST\/gRPC APIs, version control, and advanced features like auto-batching and Prometheus metrics.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">Prometheus metrics: scrape http:\/\/&lt;SERVER_IP&gt;:8501\/monitoring\/prometheus<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading has-black-color has-text-color has-link-color has-medium-font-size wp-elements-a71423ea039420c25eab940c5cd70f29\"><strong>Who it&#8217;s for:<\/strong> Teams already using TensorFlow, especially in larger systems where model versioning, rollback, and observability are critical.<\/h3>\n\n\n\n<h3 class=\"wp-block-heading has-black-color has-text-color has-link-color has-medium-font-size wp-elements-228d008d2d2b6313b2d4c116d9d770fa\"><strong>vLLM: Serve LLMs with an OpenAI-style API<\/strong>.<\/h3>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-78fa169e6039f379c20882bba79c0ead wp-block-paragraph\">vLLM is optimized for large language models (LLMs). It gives you an OpenAI-style interface, with support for streaming responses, dynamic batching, and low latency.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-black-color has-text-color has-link-color has-medium-font-size wp-elements-7353d35408fc9589e10ab529e3f3c951\"><strong>Who it&#8217;s for:<\/strong> Teams building chatbot-style interfaces, AI assistants, or anything that mimics OpenAI&#8217;s APIs without vendor lock-in.<\/h3>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Triton + TensorRT-LLM: high-performance inference<\/strong>.<\/h3>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-1f070306b16b9a3f515a8a090919cb66 wp-block-paragraph\">NVIDIA Triton provides multi-framework serving and a Prometheus metrics endpoint (default at <code>:8002\/metrics<\/code>). Pair with TensorRT-LLM for peak GPU throughput. <a href=\"https:\/\/docs.nvidia.com\/deeplearning\/triton-inference-server\/archives\/triton_inference_server_1150\/user-guide\/docs\/metrics.html?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noreferrer noopener\">NVIDIA Docs<\/a><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"has-medium-font-size wp-block-paragraph\"><strong>Note on TorchServe:<\/strong> The repository was <strong>archived<\/strong> on <strong>Aug 7, 2025<\/strong> and marked \u201cLimited Maintenance.\u201d Do not adopt it for new systems.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Who it&#8217;s for:<\/strong> Advanced teams with large models and GPU-heavy workloads especially when running on <a href=\"https:\/\/upcloud.com\/global\/products\/gpu-servers\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>UpCloud GPUs<\/strong><\/a> or similar infrastructure.<\/h3>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Why observability matters: Don\u2019t fly blind<\/strong><\/h3>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Whether you use TF Serving, Triton, or vLLM, you need to monitor<strong> <\/strong>what\u2019s happening in production:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-67a97b6a534c5f8ef12e81fc457b3ac5\">How many requests is the model getting?<\/li>\n\n\n\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-23ac92bcaec6c9c689dffda332e5a7b0\">How fast is it responding?<\/li>\n\n\n\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-3cf15693aadea2da9a239f20327b9e6f\">Is it failing silently?<br><\/li>\n\n\n\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-a067e550976eb645b7049739087c3923\">Prometheus + Grafana gives you answers. TF Serving, Triton, and vLLM expose Prometheus metrics. You can scrape data like latency, error rates, and queue sizes then visualize them in dashboards that your team (and leadership) will care about.<\/li>\n\n\n\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-56c192cb23415003a32b403d8b56fe2e\"><strong>New to this?<\/strong> You can follow <a href=\"https:\/\/upcloud.com\/global\/resources\/tutorials\/monitoring-upcloud-prometheus-part-1\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Prometheus series<\/a> and set up your first dashboard.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>Where LLMOps Frameworks Fit<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Pick by priority: ease of use, portability, performance, or production. Scan the matrices and match the row to your use case.<\/p>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color wp-elements-d1ac4b384c52e7e26734c48abc806242 wp-block-paragraph\"><strong>Quick picks:<\/strong> Ease: Keras 3, BentoML. Performance: Triton\/TensorRT-LLM; for LLMs, vLLM. Portability: KServe, BentoML, Ray Serve. Production: TF Serving, KServe, Triton\/TensorRT-LLM<\/p>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color wp-elements-47a2a5fcfe442c8f54456b71cec221bd wp-block-paragraph\"><strong>Legend:<\/strong> \u2714 great fit \u00b7 \u25b3 workable with effort \u00b7 \u2716 not ideal\/unsupported.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Dev &amp; Compile<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Use case \/ priority<\/strong><\/td><td><strong>Keras 3<\/strong><\/td><td><strong>PyTorch<\/strong><\/td><td><strong>TensorFlow<\/strong><\/td><td><strong>torch.compile \/ AOTI<\/strong><\/td><td><strong>XLA (TF)<\/strong><\/td><\/tr><tr><td>Beginner-friendly<\/td><td>\u2714 High<\/td><td>\u2714 Moderate<\/td><td>\u25b3 Steeper<\/td><td>\u2716 Advanced only<\/td><td>\u2716 Advanced only<\/td><\/tr><tr><td>Rapid prototyping<\/td><td>\u2714<\/td><td>\u2714<\/td><td>\u25b3 More verbose<\/td><td>\u25b3 Extra tuning<\/td><td>\u25b3 Compilation req<\/td><\/tr><tr><td>Production inference<\/td><td>\u25b3 Needs glue<\/td><td>\u2714 With setup<\/td><td>\u2714 Strong<\/td><td>\u2714 Faster start time<\/td><td>\u2714 Mature<\/td><\/tr><tr><td>LLM inference (chatbots etc.)<\/td><td>\u25b3 Not ideal<\/td><td>\u2714 via vLLM<\/td><td>\u25b3 via vLLM<\/td><td>\u25b3 Limited impact<\/td><td>\u25b3 Works w\/ XLA<\/td><\/tr><tr><td>Observability (Prometheus)<\/td><td>\u2716 Add-on<\/td><td>\u25b3 via serving stack (Triton\/vLLM)<\/td><td>\u2714 Built-in<\/td><td>\u25b3 Needs setup<\/td><td>\u2714 Built-in<\/td><\/tr><tr><td>Hardware optimization (GPU\/TPU)<\/td><td>\u25b3 Basic<\/td><td>\u2714 CUDA, AMP<\/td><td>\u2714 CUDA, TPU<\/td><td>\u2714 Fast startup<\/td><td>\u2714 XLA on TPU<\/td><\/tr><tr><td>Multi-framework portability<\/td><td>\u2714 Top choice<\/td><td>\u25b3 Code changes<\/td><td>\u25b3 Code changes<\/td><td>\u25b3 Some friction<\/td><td>\u2716 TF-only<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Serving &#8211; General<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Use case \/ priority<\/strong><\/td><td><strong>TorchServe (Legacy; archived Aug 2025)<\/strong><\/td><td><strong>TF Serving<\/strong><\/td><td><strong>BentoML<\/strong><\/td><td><strong>KServe<\/strong><\/td><\/tr><tr><td>Beginner-friendly<\/td><td>\u2716 Legacy\/archived<\/td><td>\u2714 Moderate<\/td><td>\u2714 High<\/td><td>\u25b3 Steeper<\/td><\/tr><tr><td>Rapid prototyping<\/td><td>\u2716<\/td><td>\u2714 Fast setup<\/td><td>\u2714 CLI quickstart<\/td><td>\u25b3 YAML + K8s<\/td><\/tr><tr><td>Production inference<\/td><td>\u2716 Use TF Serving or Triton<\/td><td>\u2714<\/td><td>\u2714 Strong packaging<\/td><td>\u2714 Mature on K8s<\/td><\/tr><tr><td>LLM inference (chatbots etc.)<\/td><td>\u2716<\/td><td>\u25b3 Basic support<\/td><td>\u25b3 via integrations<\/td><td>\u25b3 via runtimes<\/td><\/tr><tr><td>Observability (Prometheus)<\/td><td>\u2716<\/td><td>\u2714<\/td><td>\u2714 Built-in<\/td><td>\u2714 via Prometheus<\/td><\/tr><tr><td>Hardware optimization<\/td><td>\u2716<\/td><td>\u2714 GPU<\/td><td>\u25b3 Runner-dependent<\/td><td>\u25b3 Backend-dependent<\/td><\/tr><tr><td>Portability<\/td><td>\u2716<\/td><td>\u2716 TF-only<\/td><td>\u2714 Any runtime<\/td><td>\u2714 Multi-runtime<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Serving &#8211; LLM-focused<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Use case \/ priority<\/strong><\/td><td><strong>vLLM<\/strong><\/td><td><strong>Triton \/ TensorRT-LLM<\/strong><\/td><td><strong>Ray Serve<\/strong><\/td><td><strong>KServe<\/strong><\/td><\/tr><tr><td>Beginner-friendly<\/td><td>\u25b3 Dev-focused<\/td><td>\u2716 Not-beginner-ready<\/td><td>\u25b3 Requires Ray<\/td><td>\u25b3 Steeper<\/td><\/tr><tr><td>Rapid prototyping<\/td><td>\u2714 LLMs only<\/td><td>\u2716 Setup complexity<\/td><td>\u25b3 Cluster setup<\/td><td>\u25b3 YAML + K8s<\/td><\/tr><tr><td>Production inference<\/td><td>\u2714<\/td><td>\u2714 Best performance<\/td><td>\u2714 Scales on Ray<\/td><td>\u2714 Mature on K8s<\/td><\/tr><tr><td>LLM inference (chatbots etc.)<\/td><td>\u2714 Native<\/td><td>\u2714 Top-tier<\/td><td>\u2714 via vLLM<\/td><td>\u25b3 via runtimes<\/td><\/tr><tr><td>Observability (Prometheus)<\/td><td>\u25b3 Basic<\/td><td>\u2714 With setup<\/td><td>\u2714 Built-in<\/td><td>\u2714 via Prometheus<\/td><\/tr><tr><td>Hardware optimization<\/td><td>\u2714 Good<\/td><td>\u2714 Best on NVIDIA<\/td><td>\u25b3 Backend-dependent<\/td><td>\u25b3 Backend-dependent<\/td><\/tr><tr><td>Portability<\/td><td>\u2714 Python API<\/td><td>\u25b3 NVIDIA ecosystem<\/td><td>\u2714 Any backend<\/td><td>\u2714 Multi-runtime<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Notes:<\/strong> vLLM targets LLM serving. Triton\/TensorRT-LLM favors NVIDIA GPUs. KServe assumes Kubernetes. Ray Serve is an app-level scaler and often pairs with vLLM.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>Run it on UpCloud: Your first GPU deployment<\/strong><\/h2>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">All the frameworks, compilers, and serving stacks we\u2019ve covered need solid infrastructure. That\u2019s where UpCloud GPUs come in: fast, reliable, and ready to run production-grade deep learning.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/upcloud.com\/media\/tf-serving2-1-1.png\" alt=\"-\" class=\"wp-image-65864\" \/><\/figure>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Here\u2019s a quick path from model to deployed endpoint on UpCloud.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Step 1: Deploy a GPU server<\/strong><\/h3>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Spin up a GPU instance from the UpCloud control panel. Choose an L40S GPU plan in fi-hel2; the command below auto-selects one from your account.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\"># Auto-pick an L40S GPU plan in fi-hel2 and create a server\n# Requires: upctl logged in, jq installed\n\nZONE=\"fi-hel2\"\nPLAN=\"$(upctl server plans -o json \\\n  | jq -r '.[] | select(.name | test(\"^GPU-.*L40S$\")) | .name' \\\n  | head -n1)\"\n\nif [ -z \"$PLAN\" ]; then\n  echo \"No L40S GPU plans found in your account.\" &gt;&amp;2\n  exit 1\nfi\n\nupctl server create \\\n  --title gpu-server \\\n  --zone \"$ZONE\" \\\n  --plan \"$PLAN\" \\\n  --ssh-keys ~\/.ssh\/id_*.pub \\\n  --wait<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Step 2: Install your framework<\/strong><\/h3>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\"><strong>PyTorch:<\/strong> use the official selector to get the right command for your OS, Python, and CUDA. Do <strong>not<\/strong> hard-code an old wheel index.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Use the official installer selector for your OS\/Python\/CUDA and copy the command. <a href=\"https:\/\/pytorch.org\/get-started\" target=\"_blank\" rel=\"noreferrer noopener\">Get started &#8211; PyTorch. <\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>TensorFlow:<\/strong> install the current release (2.20.0 as of Oct 2, 2025). <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">pip install tensorflow==2.20.0<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Verify the GPU:<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">nvidia-smi<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Step 3: Serve your model<\/strong><\/h3>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Option A: vLLM (LLMs, OpenAI-compatible)<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">pip install vllm\nvllm serve meta-llama\/Llama-3.2-3B-Instruct --port 8000\n# API base: http:\/\/&lt;SERVER_IP&gt;:8000\/v1\n\nPrometheus: http:\/\/&lt;SERVER_IP&gt;:8000\/metrics  (default server port: 8000)<\/code><\/pre>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Default server binds on 8000 and exposes <code>\/metrics<\/code>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Option B: Triton + TensorRT-LLM (max throughput)<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">docker run --gpus=all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \\\n  nvcr.io\/nvidia\/tritonserver:24.08-py3 tritonserver --model-repository=\/models\n# HTTP :8000, gRPC :8001, Prometheus metrics :8002<\/code><\/pre>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Metrics endpoint and default ports per docs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Option C: TF Serving (TensorFlow models)<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">docker run -p 8501:8501 \\\n  -v \/models\/my_model:\/models\/my_model \\\n  -e MODEL_NAME=my_model -t tensorflow\/serving<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Step 4: Add observability<\/strong><\/h3>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Both <strong>vLLM<\/strong> and <strong>Triton<\/strong> expose Prometheus metrics and provide example dashboards you can import into Grafana.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Prometheus scrape examples<\/strong><\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">scrape_configs:\n  - job_name: vllm\n    static_configs:\n      - targets: ['&lt;SERVER_IP&gt;:8000']  # vLLM exposes \/metrics\n  - job_name: triton\n    static_configs:\n      - targets: ['&lt;SERVER_IP&gt;:8002']  # Triton \/metrics<\/code><\/pre>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">References for enabling and scraping metrics. <a href=\"https:\/\/deepwiki.com\/vllm-project\/production-stack\/6.1-metrics-and-dashboards\" target=\"_blank\" rel=\"noreferrer noopener\">DeepWiki<\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Step 5: Scale<\/strong><\/h3>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Scale horizontally with Kubernetes on UpCloud or vertically by moving up GPU plans. As traffic grows, add request routing and autoscaling via KServe or Ray Serve.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>Conclusion: building your 2026 AI stack<\/strong><\/h2>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">By 2026, torch.compile and AOTInductor artifacts, XLA compilation, and modern serving stacks like vLLM and Triton are standard practice. The question isn\u2019t PyTorch or TensorFlow; it\u2019s how your <strong>whole stack<\/strong> fits for cost, scale, and performance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\"><strong>Practical picks<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-d1c36e0de534e4a658eb30e59c47752b\"><strong>Portability:<\/strong> Keras 3 for one codebase across TF\/PyTorch\/JAX.<a href=\"https:\/\/github.com\/keras-team\/keras\" target=\"_blank\" rel=\"noreferrer noopener\"> GitHub<\/a><\/li>\n\n\n\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-460c84fa132caeff9159f5231f610a33\"><strong>DX &amp; prototyping:<\/strong> PyTorch.<\/li>\n\n\n\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-03a567b0e583f48bf90f30c62d92f0b0\"><strong>Mature pipelines \/ GCP:<\/strong> TensorFlow.<\/li>\n\n\n\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-0b8d53e315140fee412146731e52c485\"><strong>Compiler layer:<\/strong> pick torch.compile vs torch.export+AOTInductor vs XLA based on <strong>startup latency<\/strong> vs <strong>steady-state throughput<\/strong>. <em><em>(AOTInductor packaging APIs are prototype and may change.)<\/em><\/em><a href=\"https:\/\/docs.pytorch.org\/tutorials\/recipes\/torch_export_aoti_python.html\" target=\"_blank\" rel=\"noreferrer noopener\"> PTorch Documentation<\/a><\/li>\n\n\n\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-ea6135f8b914074485014262f83c14f6\"><strong>Serving:<\/strong> TF Serving, vLLM, Triton\/TensorRT-LLM per API shape and GPU scheduling.<\/li>\n\n\n\n<li class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-c2fa53cffbb2244be56cb06890f3cf3d\"><strong>Observability:<\/strong> Prometheus\/Grafana using built-in exporters and dashboards.<a href=\"https:\/\/docs.vllm.ai\/en\/stable\/usage\/metrics.html\" target=\"_blank\" rel=\"noreferrer noopener\"> VLLM Documentation<\/a><\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size wp-block-paragraph\">Adopt the layered mindset. You\u2019ll avoid lock-in, optimize for today, and stay ready for what ships next.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>FAQ<\/strong><\/h2>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-f80299660af646243da83faff68c34f0 wp-block-paragraph\"><strong>1) Is PyTorch or TensorFlow faster in 2026?<br><\/strong> Comparable for most users. Compilers like torch.compile and XLA matter more than the logo. Enable them and measure.<\/p>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-91e2cca0d813b10c15ddc6e72c76e41a wp-block-paragraph\"><strong>2) Should beginners learn PyTorch or TensorFlow first?<\/strong><strong><br><\/strong> Start with PyTorch for clarity and fast iteration. Use TensorFlow as your pipeline hardens.<\/p>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-ab3485e7d6024732cdc125297cf54114 wp-block-paragraph\"><strong>3) Where does Keras 3 fit in?<br><\/strong> A multi-backend frontend that runs on TF, JAX, and PyTorch. Great for portability and teaching.<\/p>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-34b076682149dc8f368e51a55705645a wp-block-paragraph\"><strong>4) How do I serve a PyTorch LLM in production?<\/strong><strong><br><\/strong> Use <strong>vLLM<\/strong> for an OpenAI-style API or <strong>Triton\/TensorRT-LLM<\/strong> for maximum throughput. Both expose Prometheus metrics for SLOs.<a href=\"https:\/\/docs.vllm.ai\/en\/stable\/usage\/metrics.html?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener\">&nbsp;<\/a><\/p>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color wp-elements-e86c84b7adb85bca730bb8a25a1d4431 wp-block-paragraph\"><strong>5) Do I need a compiler like torch.compile or XLA?<br><\/strong> If you care about latency or cost, yes. They optimize execution. XLA is part of <strong>OpenXLA<\/strong> and powers TF and JAX.<\/p>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-5b2c2e2de25f1c4b6cb5867265721018 wp-block-paragraph\"><strong>6) What about TorchServe?<br><\/strong>Do not start new projects with it. The repo was archived in Aug 2025 and marked limited maintenance.<\/p>\n\n\n\n<p class=\"has-black-color has-text-color has-link-color has-medium-font-size wp-elements-cdb3338eb388080f23e1fa885a8cc028 wp-block-paragraph\"><strong>7) Can I switch frameworks later?<br><\/strong> Keras 3 and ONNX improve portability, but custom ops and advanced layers add friction.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Frontends \u00b7 Compilers \u00b7 Serving: The modern AI stack is layered. You build and train in a frontend, you optimize execution with a compiler, and [&hellip;]<\/p>\n","protected":false},"author":19,"featured_media":65633,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_relevanssi_hide_post":"","_relevanssi_hide_content":"","_relevanssi_pin_for_all":"","_relevanssi_pin_keywords":"","_relevanssi_unpin_keywords":"","_relevanssi_related_keywords":"","_relevanssi_related_include_ids":"","_relevanssi_related_exclude_ids":"","_relevanssi_related_no_append":"","_relevanssi_related_not_related":"","_relevanssi_related_posts":"343,112,127,172,223,412","_relevanssi_noindex_reason":"Blocked by a filter function","footnotes":""},"categories":[28,76],"tags":[79],"class_list":["post-91","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-long-reads","category-gpus","tag-gpu-servers"],"acf":[],"_links":{"self":[{"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/posts\/91","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/users\/19"}],"replies":[{"embeddable":true,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/comments?post=91"}],"version-history":[{"count":0,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/posts\/91\/revisions"}],"wp:attachment":[{"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/media?parent=91"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/categories?post=91"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/tags?post=91"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}