閱讀(14.2k) 書簽贊(0) 我要糾錯

Spark SQL RDDs

2018-11-26 16:33 更新

RDDs

Spark支持兩種方法將存在的RDDs轉換為SchemaRDDs。第一種方法使用反射來推斷包含特定對象類型的RDD的模式(schema)。在你寫spark程序的同時，當你已經知道了模式，這種基于反射的方法可以使代碼更簡潔并且程序工作得更好。

創(chuàng)建SchemaRDDs的第二種方法是通過一個編程接口來實現(xiàn)，這個接口允許你構造一個模式，然后在存在的RDDs上使用它。雖然這種方法更冗長，但是它允許你在運行期之前不知道列以及列的類型的情況下構造SchemaRDDs。

利用反射推斷模式

Spark SQL的Scala接口支持將包含樣本類的RDDs自動轉換為SchemaRDD。這個樣本類定義了表的模式。

給樣本類的參數(shù)名字通過反射來讀取，然后作為列的名字。樣本類可以嵌套或者包含復雜的類型如序列或者數(shù)組。這個RDD可以隱式轉化為一個SchemaRDD，然后注冊為一個表。表可以在后續(xù)的sql語句中使用。

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD

// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

編程指定模式

當樣本類不能提前確定（例如，記錄的結構是經過編碼的字符串，或者一個文本集合將會被解析，不同的字段投影給不同的用戶），一個SchemaRDD可以通過三步來創(chuàng)建。

從原來的RDD創(chuàng)建一個行的RDD
創(chuàng)建由一個StructType表示的模式與第一步創(chuàng)建的RDD的行結構相匹配
在行RDD上通過applySchema方法應用模式

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Import Spark SQL data types and Row.
import org.apache.spark.sql._

// Generate the schema based on the string of schema
val schema =
  StructType(
    schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))

// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)

// Register the SchemaRDD as a table.
peopleSchemaRDD.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")

// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
results.map(t => "Name: " + t(0)).collect().foreach(println)

以上內容是否對您有幫助：

← Spark SQL數(shù)據(jù)源

Spark SQL parquet文件 →

寫筆記

我要補充