
且构网 - 分享程序员编程开发的那些事


更新时间:1970-01-01 07:55:48

更新-从 Spark 1.6 开始,您可以简单地使用内置的csv数据源:

Update - as of Spark 1.6, you can simply use the built-in csv data source:

spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")


You can also use various options to control the CSV parsing, e.g.:

val df = spark.read.option("header", "false").csv("file.txt")

对于Spark版本< 1.6 : 最简单的方法是使用 spark-csv -将其包含在依赖项中并遵循自述文件,允许设置自定义定界符(;),可以读取CSV标头(如果有),并且可以推断模式 types (这需要额外扫描数据).

For Spark version < 1.6: The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).


Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:

case class Record(id: Int, name: String)

val myFile1 = myFile.map(x=>x.split(";")).map {
  case Array(id, name) => Record(id.toInt, name)

myFile1.toDF() // DataFrame will have columns "id" and "name"