Read CSV Files and Store Them in Delta Lake Tables Using Databricks
How to Read CSV Files and Store Them in Delta Lake Tables Using Databricks
In this blog post, weâll explore how to read both single and multiple CSV files into a DataFrame and then store them in a Delta Lake table using Databricks. This process is essential for building efficient and reliable data pipelines.
Prerequisites
Before we begin, ensure you have the following:
- An active Databricks account
- Basic knowledge of PySpark
- CSV files to work with
Step 1: Setting Up Your Environment
First, youâll need to set up your Databricks environment. Create a new notebook in Databricks and attach it to a cluster.
Step 2: Reading Single and Multiple CSV Files
Reading a Single CSV File
To read a single CSV file, you can use the following PySpark code:
# Read a single CSV file into a DataFrame
df_single = spark.read.format("csv").option("header", "true").load("/path/to/your/singlefile.csv")
# Display the DataFrame
df_single.show()
Reading Multiple CSV Files
To read multiple CSV files, you can use a wildcard in the file path:
# Read multiple CSV files into a DataFrame
df_multiple = spark.read.format("csv").option("header", "true").load("/path/to/your/csvfiles/*.csv")
# Display the DataFrame
df_multiple.show()
Step 3: Writing DataFrames to Delta Lake Tables
Once you have your DataFrames, you can write them to Delta Lake tables. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.
Writing a Single DataFrame to Delta Lake
# Write the single DataFrame to a Delta Lake table
df_single.write.format("delta").save("/path/to/delta/table/single")
Writing Multiple DataFrames to Delta Lake
# Write the multiple DataFrame to a Delta Lake table
df_multiple.write.format("delta").save("/path/to/delta/table/multiple")
Step 4: Verifying the Delta Lake Tables
To ensure that your data has been written correctly, you can read the Delta Lake tables back into DataFrames and display them:
# Read the Delta Lake table into a DataFrame
df_single_delta = spark.read.format("delta").load("/path/to/delta/table/single")
df_multiple_delta = spark.read.format("delta").load("/path/to/delta/table/multiple")
# Display the DataFrames
df_single_delta.show()
df_multiple_delta.show()
Conclusion
By following these steps, you can efficiently read CSV files and store them in Delta Lake tables using Databricks. This approach ensures that your data pipelines are robust and scalable, leveraging the powerful features of Delta Lake.
- Submitted By Vibhuti Singh
- Category databricks
- Created On 23-Aug-2024