{"id":1810,"date":"2025-11-03T21:13:07","date_gmt":"2025-11-03T19:13:07","guid":{"rendered":"https:\/\/upcloud.com\/global\/us\/resources\/tutorials\/deploying-an-open-source-data-platform-on-upcloud\/"},"modified":"2025-11-03T21:13:07","modified_gmt":"2025-11-03T19:13:07","slug":"deploying-an-open-source-data-platform-on-upcloud","status":"publish","type":"tutorial","link":"https:\/\/upcloud.com\/global\/resources\/tutorials\/deploying-an-open-source-data-platform-on-upcloud\/","title":{"rendered":"Deploying an open-source data platform on UpCloud"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Managing data at scale is now baseline. Teams must reliably ingest, store, transform, process, and serve data across many sources and consumers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A modern data platform sits at the core of well-run companies. It drives market insight, shows the state of the business, and powers the AI systems teams rely on.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This tutorial is courtesy of Niels Claeys at <a href=\"https:\/\/www.dataminded.com\/\" target=\"_blank\" rel=\"noopener\">Dataminded<\/a>. It uses only open-source tools to deliver an end-to-end, production-ready solution. Consider it a practical blueprint for building a robust platform without vendor lock-in.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Introduction of the project<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This tutorial walks you through setting up a cloud-agnostic, open-source data platform on UpCloud using <a href=\"https:\/\/opentofu.org\/\" target=\"_blank\" rel=\"noopener\">opentofu<\/a>. All the necessary code is available in <a href=\"https:\/\/github.com\/datamindedbe\/demo-upcloud-data-platform\" target=\"_blank\" rel=\"noopener\">Github<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Estimated deployment time: ~30\u201345 minutes (most of it waiting for the Kubernetes cluster, Load balancer and managed database to be ready). Expected cost: ~10\u20acX\/day for the demo setup.\u201d<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Architecture overview<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Before we start, here is a high-level overview of the architecture and components used in this demo data platform.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/upcloud.com\/media\/image-282.png\" alt=\"-\" class=\"wp-image-68115\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Main components of the infrastructure are as follows:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/trino.io\/\" target=\"_blank\" rel=\"noopener\">Trino<\/a>: A distributed SQL engine for interactive queries across large and small datasets It allows us to build a data warehouse on UpCloud without depending on a managed service.<\/li>\n\n\n\n<li><a href=\"https:\/\/docs.lakekeeper.io\/\" target=\"_blank\" rel=\"noopener\">Lakekeeper<\/a>: The production-ready metadata catalog for Iceberg tables, tightly integrated with Trino and OPA.<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/datamindedbe\/demo-upcloud-data-platform\/blob\/main\/docs\" target=\"_blank\" rel=\"noopener\">Open policy agent (OPA)<\/a>: A general-purpose policy engine used here to enforce fine-grained data access control.<\/li>\n\n\n\n<li><a href=\"https:\/\/traefik.io\/\" target=\"_blank\" rel=\"noopener\">Traefik<\/a>: A reverse proxy and ingress controller that manages SSL termination and routes traffic to the different services of our data platform.<\/li>\n\n\n\n<li><a href=\"https:\/\/zitadel.com\/\" target=\"_blank\" rel=\"noopener\">Zitadel<\/a>: An identity and access management platform that handles user and application authentication, with support for integration into your company\u2019s identity provider.<\/li>\n<\/ul>\n\n\n\n<h1 class=\"wp-block-heading\">Prerequisites<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Before starting the deployment, make sure you have:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A verified <a href=\"https:\/\/upcloud.com\/global\/\">UpCloud account<\/a> with an API enabled subaccount for creating resources.<\/li>\n\n\n\n<li>A hosted domain and DNS provider for assigning a subdomain to the data platform stack.<\/li>\n\n\n\n<li>Installed <a href=\"https:\/\/opentofu.org\/\" target=\"_blank\" rel=\"noopener\">OpenTofu<\/a>, <a href=\"https:\/\/kubernetes.io\/docs\/reference\/kubectl\/\" target=\"_blank\" rel=\"noopener\">kubectl<\/a>, AWS CLI (for the S3-compatible object storage backend)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Before diving into infra deployment, include quick sanity checks to check if needed tools are installed and in which version:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">Prerequisites\nBefore starting the deployment, make sure you have:\nA verified UpCloud account with an API enabled subaccount for creating resources.\nA hosted domain and DNS provider for assigning a subdomain to the data platform stack.\nInstalled OpenTofu, kubectl, AWS CLI (for the S3-compatible object storage backend)\nBefore diving into infra deployment, include quick sanity checks to check if needed tools are installed and in which version:<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Deploying the platform<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Now you are ready to start deploying the platform on UpCloud.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Clone the repo and bootstrap Opentofu state storage<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">git clone https:\/\/github.com\/datamindedbe\/demo-upcloud-data-platform.git\ncd demo-upcloud-data-platform<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">As a first step, we need to create the OpenTofu state storage such that our OpenTofu state is stored remotely and can be used by multiple people at the same time.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>export your UpCloud user credentials in your current shell:<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">export UPCLOUD_USERNAME=\"your_upcloud_username\"\nexport UPCLOUD_PASSWORD=\"your_upcloud_password\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">go to the bootstrap folder and initialize backend, which will create an S3 bucket for our state files<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">cd infra\/bootstrap\ntofu init\ntofu apply<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">When successful, the outputs will <em><strong>print storage_bucket_name <\/strong><\/em>and <em><strong>storage_bucket_domain_name<\/strong><\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Deploy the core infrastructure (Kubernetes, Postgres, Traefik, Zitadel)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Next, we will create the core infrastructure which will be needed by all our other platform applications. All the infrastructure code is in the infra\/foundation folder.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>create terraform.tfvars from the terraform.tfvars.example and update the variables as needed<\/li>\n\n\n\n<li>In order to use the S3-compatible object storage as a backend for OpenTofu, you need to configure your AWS CLI with a profile named upcloud. You can find the necessary steps in the <a href=\"https:\/\/hub.upcloud.com\/object-storage\/2.0\">UpCloud object storage overview<\/a> in your object storage for S3 programmatic access.<\/li>\n\n\n\n<li>Setup our Postgres database and Kubernetes cluster, which can take 10-15 minutes:<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">tofu init -var-file=terraform.tfvars\ntofu apply -var-file=terraform.tfvars \\\n-target=upcloud_kubernetes_cluster.this \\\n-target=upcloud_kubernetes_node_group.default_group \\\n-target=upcloud_managed_database_postgresql.db<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">After the cluster is ready you can inspect the pods\/nodes:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">export KUBECONFIG=$(pwd)\/.kubeconfig.yml\nkubectl get nodes<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Next up is Traefik, which will create a load balancer with a public IP address and expose our applications to the outside world.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">tofu apply -var-file=terraform.tfvars -target=module.traefik<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">After 5 minutes, UpCloud will assign a public IP address to the load balancer. You can find this in the UpCloud console under Load Balancers -&gt; Services. In order to get DNS working, you need to add an A record in your DNS provider that points to the load balancer IP address.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Finally, you can create the remaining foundation resources using:<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">tofu apply -var-file=terraform.tfvars<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In order to validate that everything is working correctly, make sure that the zitadel service is reachable at <strong>https:\/\/zitadel.&lt;your-domain&gt;<\/strong> (replace with the domain you configured). The login page looks like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/upcloud.com\/media\/image-279-1024x644.png\" alt=\"-\" class=\"wp-image-68106\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">If this is not working, double check your DNS configuration and the Traefik, Zitadel pods in kubernetes for any errors.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Create a Zitadel service user<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Before we can setup the remainder of our stack, we need to create a service user in Zitadel. This will allow us to configure authentication (using oauth) for every application on our stack.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Go to the service users tab and create a new service user.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/upcloud.com\/media\/image-280-1024x405.png\" alt=\"-\" class=\"wp-image-68108\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Make sure to assign the service user the Org owner and Iam owner roles.<\/li>\n\n\n\n<li>Create a key for the service user. Zitadel will create the json key that you need for the OpenTofu provider.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/upcloud.com\/media\/image-281.png\" alt=\"-\" class=\"wp-image-68111\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Put the key in the infra\/apps directory as a token.json file.<\/li>\n\n\n\n<li>Look up the organization ID in the Zitadel UI, you will need it in the next step.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Deploy the platform applications (Trino, Lakekeeper, OPA)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Now we are ready to deploy the applications that make up our data platform stack: OPA, Lakekeeper and Trino.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>create terraform.tfvars from the terraform.tfvars.example and update the variables as needed.<\/li>\n\n\n\n<li>Apply all resources in the infra\/apps folder as follows:<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">tofu init -var-file=terraform.tfvars\ntofu apply -var-file=terraform.tfvars<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting tips<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Trino won\u2019t start? Check kubectl get pods -n services -l app.kubernetes.io\/instance=trino and kubectl logs -n services &lt;pod-name&gt; for crash logs.<\/li>\n\n\n\n<li>DNS not resolving, getting connection refused when browsing to one of the services? Make sure the A record is added to your DNS provicer with the Traefik LB IP.<\/li>\n\n\n\n<li>SSL certificate errors? Check the logs of the traefik pod kubectl logs -n traefik &lt;traefik-pod-name&gt;. Also double check your Loadbalancer configuration in UpCloud.<\/li>\n\n\n\n<li>Zitadel login fails? Verify that your DNS + SSL certificates are configured correctly.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Using the data platform<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Now that the full stack is deployed, you can start using it.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Configure your warehouse in Lakekeeper<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Before you can run SQL queries, you need to configure a Lakekeeper warehouse that points to the S3-compatible object storage.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To get the necessary information, run the following command in the infra\/foundation folder:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">tofu output s3_warehouse_info<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now go to the Lakekeeper UI at https:\/\/lakekeeper.&lt;your-domain&gt;\/ui and login. From there click on Warehouses in the left menu and then click on Create Warehouse. Fill in the form using the information retrieved from the previous command.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Running your first SQL queries<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">You can now run your first SQL queries against Trino. You can use your favorite SQL client that supports Trino or use the Trino CLI.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Download the Trino CLI if you haven&#8217;t done this already:<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">curl -o trino https:\/\/repo1.maven.org\/maven2\/io\/trino\/trino-cli\/476\/trino-cli-476-executable.jar\nchmod a+x trino<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Connect to Trino using the CLI:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">trino --server trino.&lt;your-domain&gt; --user &lt;zitadel-admin-email&gt;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Execute queries inside Trino:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">-- Find available catalogs &amp; schemas:\n\nSHOW CATALOGS;\n-- check the schemas in the iceberg catalog\nSHOW SCHEMAS FROM iceberg;\n\n-- Create a demo schema and table, insert some data and run a few queries\nCREATE SCHEMA IF NOT EXISTS iceberg.demo;\n\nCREATE TABLE IF NOT EXISTS iceberg.demo.events (\nid BIGINT,\nuser_name VARCHAR,\nevent_type VARCHAR,\nts TIMESTAMP\n);\n\nINSERT INTO iceberg.demo.events (id, user_name, event_type, ts) VALUES\n(1, 'alice', 'login', TIMESTAMP '2025-09-18 08:00:00'),\n(2, 'bob',   'click', TIMESTAMP '2025-09-18 08:05:00');\n\nSELECT event_type, count(*) as cnt FROM iceberg.demo.events GROUP BY event_type;\nSELECT * FROM iceberg.demo.events ORDER BY ts DESC LIMIT 10;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Congratulations! Your open-source data platform is now live on UpCloud. You can connect to Trino, create Iceberg tables, insert sample data, and query it using SQL.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Next steps<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In order to make this stack production-ready, you will need to take care of the following:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Harden OPA policies to enforce strict access cotnrol in Trino and Lakekeeper At the moment we allow all actions, but we provide the necessary rego files to restrict access using Lakekeeper. For more information, check <a href=\"https:\/\/medium.com\/datamindedbe\/locking-down-your-data-fine-grained-data-access-on-eu-clouds-41e3d5108062\" target=\"_blank\" rel=\"noopener\">our blog on the topic<\/a> as well as the <a href=\"https:\/\/docs.lakekeeper.io\/docs\/latest\/opa\/\" target=\"_blank\" rel=\"noopener\">Lakekeeper OPA bridge<\/a> for the details.<\/li>\n\n\n\n<li>Enable Zitadel-based OIDC authentication for both uses and inter-service communication.<\/li>\n\n\n\n<li>Enable autoscaling of your Kubernetes cluster by deploying the <a href=\"https:\/\/upcloud.com\/global\/docs\/guides\/cluster-autoscaler\/\">cluster-autoscaler<\/a>.<\/li>\n\n\n\n<li>Extend the stack with additional components such as Airflow for orchestration, Hashicorp Vault for secret management.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Cleaning up<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If you want to tear down the platform, you can do it as follows:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">cd infra\/apps\ntofu destroy -var-file=terraform.tfvars\n\ncd ..\/foundation\ntofu destroy -var-file=terraform.tfvars -target=module.traefik -target=module.zitadel\n# Now you can destroy the rest of the resources\ntofu destroy -var-file=terraform.tfvars\n\ncd ..\/bootstrap\ntofu destroy<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Support<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This article was written by Niels Claeys from Dataminded.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you have any questions or run into issues, feel free to open an issue in this Github repo or reach out to niels.claeys@dataminded.com or anyone else at Dataminded.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you want guidance on how to extend this stack or make it production ready, you can reach out to the team at <a href=\"https:\/\/www.dataminded.com\/contact\" target=\"_blank\" rel=\"noopener\">DataMinded<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"author":100,"featured_media":0,"comment_status":"open","ping_status":"closed","template":"","community-category":[226,229,232],"class_list":["post-1810","tutorial","type-tutorial","status-publish","hentry"],"acf":[],"_links":{"self":[{"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/tutorial\/1810","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/tutorial"}],"about":[{"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/types\/tutorial"}],"author":[{"embeddable":true,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/users\/100"}],"replies":[{"embeddable":true,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/comments?post=1810"}],"version-history":[{"count":0,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/tutorial\/1810\/revisions"}],"wp:attachment":[{"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/media?parent=1810"}],"wp:term":[{"taxonomy":"community-category","embeddable":true,"href":"https:\/\/upcloud.com\/global\/wp-json\/wp\/v2\/community-category?post=1810"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}