Read CSV Files and Store Them in Delta Lake Tables Using Databricks

How to Read CSV Files and Store Them in Delta Lake Tables Using Databricks

In this blog post, we’ll explore how to read both single and multiple CSV files into a DataFrame and then store them in a Delta Lake table using Databricks. This process is essential for building efficient and reliable data pipelines.

Prerequisites

Before we begin, ensure you have the following:

An active Databricks account
Basic knowledge of PySpark
CSV files to work with

Step 1: Setting Up Your Environment

First, you’ll need to set up your Databricks environment. Create a new notebook in Databricks and attach it to a cluster.

Step 2: Reading Single and Multiple CSV Files

Reading a Single CSV File

To read a single CSV file, you can use the following PySpark code:

# Read a single CSV file into a DataFrame
df_single = spark.read.format("csv").option("header", "true").load("/path/to/your/singlefile.csv")

# Display the DataFrame
df_single.show()

Reading Multiple CSV Files

To read multiple CSV files, you can use a wildcard in the file path:

# Read multiple CSV files into a DataFrame
df_multiple = spark.read.format("csv").option("header", "true").load("/path/to/your/csvfiles/*.csv")

# Display the DataFrame
df_multiple.show()

Step 3: Writing DataFrames to Delta Lake Tables

Once you have your DataFrames, you can write them to Delta Lake tables. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.

Writing a Single DataFrame to Delta Lake

# Write the single DataFrame to a Delta Lake table
df_single.write.format("delta").save("/path/to/delta/table/single")

Writing Multiple DataFrames to Delta Lake

# Write the multiple DataFrame to a Delta Lake table
df_multiple.write.format("delta").save("/path/to/delta/table/multiple")

Step 4: Verifying the Delta Lake Tables

To ensure that your data has been written correctly, you can read the Delta Lake tables back into DataFrames and display them:

# Read the Delta Lake table into a DataFrame
df_single_delta = spark.read.format("delta").load("/path/to/delta/table/single")
df_multiple_delta = spark.read.format("delta").load("/path/to/delta/table/multiple")

# Display the DataFrames
df_single_delta.show()
df_multiple_delta.show()

Conclusion

By following these steps, you can efficiently read CSV files and store them in Delta Lake tables using Databricks. This approach ensures that your data pipelines are robust and scalable, leveraging the powerful features of Delta Lake.

Submitted By Vibhuti Singh
Category databricks
Created On 23-Aug-2024