且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何读取多个Excel文件并将它们连接成一个Apache Spark DataFrame?

更新时间:2023-11-21 23:37:52

为此,我使用了 spark-excel 软件包。它可以添加到build.sbt文件: libraryDependencies + =com.crealytics%%spark-excel%0.8.2



在IntelliJ IDEA Scala Console中执行的代码是:

  import org.apache。 spark。Spark Spark,SparkContext} 
import org.apache.spark.sql。{SparkSession,DataFrame}
import java.io.File

val conf = new SparkConf() .setAppName(Excel to DataFrame)。setMaster(local [*])
val sc = new SparkContext(conf)
sc.setLogLevel(WARN)

val spark = SparkSession.builder()。getOrCreate()

//使用spark-excel读取xlsx文件的功能。
//具有尾随点的代码格式可以作为块发送到IJ Scala Console。
def readExcel(file:String):DataFrame = spark.read。
格式(com.crealytics.spark.excel)。
选项(位置,文件)。
选项(useHeader,true)。
选项(treatEmptyValuesAsNulls,true)。
选项(inferSchema,true)。
选项(addColorColumns,False)。
load()

val dir = new File(./ data / CCPP / sheets)
val excelFiles = dir.listFiles.sorted.map(f => f.toString)// Array [String]

val dfs = excelFiles.map(f => readExcel(f))// Array [DataFrame]
val ppdf = dfs.reduce (_.union(_))// DataFrame

ppdf.count()// res3:Long = 47840
ppdf.show(5)

控制台输出:

  --- + ----- + ------- + ----- + ------ + 
| AT | V | AP | RH | PE |
+ ----- + ----- + ------- + ----- + ------ +
| 14.96 | 41.76 | 1024.07 | 73.17 | 463.26 |
| 25.18 | 62.96 | 1020.04 | 59.08 | 444.37 |
| 5.11 | 39.4 | 1012.16 | 92.14 | 488.56 |
| 20.86 | 57.32 | 1010.24 | 76.64 | 446.48 |
| 10.82 | 37.5 | 1009.23 | 96.62 | 473.9 |
+ ----- + ----- + ------- + ----- + ------ +
只显示前5行


Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here.

The dataset used in the lab can be downloaded from UCI Machine Learning Repository. It contains a set of readings from various sensors in a gas-fired power generation plant. The format is xlsx file with five sheets.

To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame. During the training they are using Databricks Notebook but I was using IntelliJ IDEA with Scala and evaluating the code in the console.

The first step was to save all the Excel sheets into separate xlsx files named sheet1.xlxs, sheet2.xlsx etc. and put them into sheets directory.

How to read all the Excel files and concatenate them into one Apache Spark DataFrame?

For this I have used spark-excel package. It can be added to build.sbt file as : libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"

The code to execute in IntelliJ IDEA Scala Console was:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File

val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")

val spark = SparkSession.builder().getOrCreate()

// Function to read xlsx file using spark-excel. 
// This code format with "trailing dots" can be sent to IJ Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
  format("com.crealytics.spark.excel").
  option("location", file).
  option("useHeader", "true").
  option("treatEmptyValuesAsNulls", "true").
  option("inferSchema", "true").
  option("addColorColumns", "False").
  load()

val dir = new File("./data/CCPP/sheets")
val excelFiles = dir.listFiles.sorted.map(f => f.toString)  // Array[String]

val dfs = excelFiles.map(f => readExcel(f))  // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_))  // DataFrame 

ppdf.count()  // res3: Long = 47840
ppdf.show(5)

Console output:

+-----+-----+-------+-----+------+
|   AT|    V|     AP|   RH|    PE|
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
| 5.11| 39.4|1012.16|92.14|488.56|
|20.86|57.32|1010.24|76.64|446.48|
|10.82| 37.5|1009.23|96.62| 473.9|
+-----+-----+-------+-----+------+
only showing top 5 rows