Suitebriar Blog

The Smart Engineer’s Guide to Copying Data from Google Cloud Storage to Google Drive : Google colab Blog post

Written by Krunal Patel | Dec 19, 2025 1:59:59 PM

The Smart Engineer’s Guide to Moving Data from Google Cloud Storage to Google Drive

Navigating data transfers between cloud services often feels unnecessarily complex. A common challenge for data scientists and Google Workspace administrators is the “data silo” problem. Datasets reside in a Google Cloud Storage (GCS) bucket, often the result of bulk exports or engineering pipelines, while the stakeholders who need access operate entirely within Google Drive.

My name is Krunalkumar Patel, and I have been addressing this exact challenge since 2017, when Google Colaboratory (Colab) was first introduced. Even in its early days, Colab proved to be a practical way to copy data directly from GCS to Drive without routing files through a local machine.

The traditional approach of downloading data locally and re-uploading it to Drive is slow, bandwidth-intensive, and prone to failure at scale. Google Colab can serve as an authenticated, high-throughput bridge between these services. This guide outlines a reliable, production-aware method for transferring data, with a focus on handling large datasets, managing complex file paths, and avoiding common failure modes.

Why This Approach Is Better

Most tutorials suggest using cp (copy). We will use gsutil rsync (synchronize), which provides several key advantages:

  • Resumability: Skips files that are already transferred, reducing redundancy. Partial files do not resume.

  • Efficiency: Only transfers files that have changed or do not exist in the destination.

  • Safety: Using Python’s shlex library ensures that folder names with spaces or special characters, such as Shared drives, are handled correctly.

Prerequisites

Before running the code, ensure you have:

  1. A Google Cloud Project. Even though Colab is free, accessing GCS buckets typically requires a project with billing enabled. Transfers within the same multi-region are generally low cost.

  2. Permissions:

    • GCS: Your account needs Storage Object Viewer permissions on the source bucket.

    • Drive: Your account needs Editor access to the destination folder.

  3. Colab Disk Limits ("80GB Rule"): Colab uses temporary local storage. Standard VMs have about 80 to 100GB available. For safety, avoid transferring more than approximately 50GB per batch or split transfers into subfolders to prevent runtime crashes.

Step-by-Step Guide

Step 1: Open Colab and Authenticate

Open a new notebook at colab.research.google.com.

from google.colab import auth
from google.colab import drive
import os
import shlex


# 1. Authenticate with Google Cloud
print("Authenticating User...")
auth.authenticate_user()

# 2. Mount Google Drive
# This maps your Drive to the local path '/content/drive'
print("Mounting Google Drive...")
drive.mount('/content/drive')

This step authorizes Colab to access your GCS buckets and Google Drive. A pop-up may ask you to grant permissions.

Step 2: Configure Your Paths

Define source and destination paths. Use shlex.quote() to handle spaces or special characters.

# --- CONFIGURATION ---
# The name of your GCS bucket (without gs://)
# Example: 'company-data-exports'
BUCKET_NAME = 'your-gcp-bucket-name'
# The specific folder inside the bucket (optional)

# Leave empty '' if you want the whole bucket.
SOURCE_FOLDER = 'exports/2024/june/' 
# The destination path in Google Drive

# Tip: Navigate to the folder in the Colab sidebar (left), 
# right-click, and select "Copy path".
# Example: '/content/drive/Shared drives/Finance Team/Data/'
DESTINATION_PATH_RAW = '/content/drive/Shared drives/Your Shared Drive/Target Folder/'

# --- PREPARATION ---

# 1. Construct the Source URI
source_uri = f"gs://{BUCKET_NAME}/{SOURCE_FOLDER}"

# 2. Create Destination Directory if it doesn't exist
if not os.path.exists(DESTINATION_PATH_RAW):
    os.makedirs(DESTINATION_PATH_RAW)
    print(f"Created new directory: {DESTINATION_PATH_RAW}")

# 3. Sanitize the Destination Path for the Shell
# This adds quotes and escapes spaces automatically
safe_destination = shlex.quote(DESTINATION_PATH_RAW)

print(f"Source: {source_uri}")
print(f"Destination: {safe_destination}")

Step 3: Execute the Transfer with rsync

# --- EXECUTION ---

# Construct the command
# -m : Multi-threaded (faster)
# rsync : Synchronize (resumable, smarter than cp)
# -r : Recursive (includes subfolders)
command = f"gsutil -m rsync -r {source_uri} {safe_destination}"

print("Starting transfer... Please keep this tab open.")
print(f"Executing: {command}\n")

# Run the command using the magic '!' operator
!{command}

print("\Transfer process finished.")

Step 4: Flush and Verify

Colab buffers writes to Drive. To ensure all files are synced:

# Force the Drive to sync any buffered data
print("Flushing changes to Drive...")
drive.flush_and_unmount()
print("Success! Drive unmounted and data synced.")

Pro Tips for Power Users

  • Small File Transfers. Thousands of tiny files such as logs or images can be slow due to Drive metadata overhead. Zip small files first to improve speed.

  • Large Transfers ("Hairpin Traffic"). Data flows from GCS to the Colab VM to Drive, using temporary VM disk space. Split transfers over 50GB into subfolders. For massive datasets, consider GCS Transfer Service.

  • Automation. Colab free sessions disconnect after approximately 90 minutes of inactivity and have a maximum session length of about 12 hours. For daily automated transfers, use Cloud Functions or Cloud Run instead.

Troubleshooting Common Errors

  • Destination URL must match exactly 1 URL. This usually occurs due to unquoted paths. Using shlex.quote() fixes this.

  • 401 Anonymous Caller. You skipped the authentication step or lack sufficient GCS permissions. Re-run auth.authenticate_user().

  • Transport endpoint is not connected. Drive connection crashed. Restart the runtime and remount Drive.

Full Copy-Paste Script

import shlex
import os

from google.colab import auth, drive

 

# --- CONFIGURATION ---
BUCKET_NAME = 'your-bucket-name-here'

GCP_SOURCE_FOLDER = 'folder/subfolder/*'
DRIVE_DESTINATION_RAW = '/content/drive/Shared drives/Your Shared Drive/Target Folder/'
# --- END CONFIGURATION ---

def main():
print("Step 1: Authenticating")
auth.authenticate_user()

print("Step 2: Mounting Drive")
if not os.path.exists('/content/drive'):
drive.mount('/content/drive')
elsse:
print("Drive already mounted.")

source = f"gs://{BUCKET_NAME}/{GCP_SOURCE_FOLDER}"
dest = shlex.quote(DRIVE_DESTINATION_RAW)

print(f"Source: {source}")
print(f"Destination: {dest}")

print("Step 3: Copying Files")
command = f"gsutil -m rsync -r {source} {dest}"
exit_code = os.system(command)

if exit_code == 0:
print("SUCCESS. Transfer finished.")
else:
print("ERROR. Check the logs above.")

print("Step 4: Flushing Changes")
drive.flush_and_unmount()
print("Drive unmounted and data synced.")

if __name__ == "__main__":
main()

By using gsutil rsync and proper path handling, you turn a fragile script into a robust, production-ready tool. This method avoids common pitfalls with large datasets and tricky folder names, making Colab a practical intermediary between Google Cloud Storage and Google Drive.