Cast from String to Integer in Spark 2

When casting from String to Integer in Spark 2 I found a bug and here I present my workaround. Later I verified the bug does not exist in Spark 3.

To reproduce the error create a file with these contents, I named it test_feature_engineering.csv:

ID,XACT_DT,DUE_DT,AFT_CHRG_XACT_AMI,BILL_AMT
1,2020-01-11,2020-01-11, 10, 10
2,2020-01-11,2020-01-12, 10, 10
3,2020-01-11 10:10:01,2020-01-12 10:20:00, 10, 10

Open the spark-shell and type the following commands:


scala> import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}                                                     
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}                                                            
                                                                                                                                                
scala> import org.apache.spark.sql.{DataFrame, DataFrameReader, SQLContext}                                                                     
import org.apache.spark.sql.{DataFrame, DataFrameReader, SQLContext}  

scala> val schemaFiveColumns: StructType = StructType(List(                                                                                   
     |     StructField("ID", StringType, nullable = false),                                                                                     
     |     StructField("XACT_DT", StringType, nullable = true),                                                                                 
     |     StructField("DUE_DT", StringType, nullable = true),                                                                                  
     |     StructField("AFT_CHRG_XACT_AMI", StringType, nullable = true),                                                                       
     |     StructField("BILL_AMT", StringType, nullable = true)                                                                                 
     |   ))                                                                                                                                     
schemaFiveColumns: org.apache.spark.sql.types.StructType = StructType(StructField(ID,StringType,false), StructField(XACT_DT,StringType,true), St
ructField(DUE_DT,StringType,true), StructField(AFT_CHRG_XACT_AMI,StringType,true), StructField(BILL_AMT,StringType,true))                       
                                                                                                                                                
scala> def getTestDataFrame(resourceFilePath: String, schema:  StructType): DataFrame = {                                                       
     |   spark.read                                                                                                                             
     |    .schema(schema)                                                                                                                       
     |    .option("header", value=true)                                                                                                         
     |    .csv(resourceFilePath)                                                                                                                
     | }                                                                                                                                        
getTestDataFrame: (resourceFilePath: String, schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame  

scala> val df = getTestDataFrame("file:///path_to_file/test_feature_engineering.csv", schemaFiveColumns)
df: org.apache.spark.sql.DataFrame = [ID: string, XACT_DT: string ... 3 more fields]

scala> df.show()
+---+-------------------+-------------------+-----------------+--------+
| ID|            XACT_DT|             DUE_DT|AFT_CHRG_XACT_AMI|BILL_AMT|
+---+-------------------+-------------------+-----------------+--------+
|  1|         2020-01-11|         2020-01-11|               10|      10|
|  2|         2020-01-11|         2020-01-12|               10|      10|
|  3|2020-01-11 10:10:01|2020-01-12 10:20:00|               10|      10|
+---+-------------------+-------------------+-----------------+--------+


scala> val df2 = df.withColumn(colName = "res", df.col("AFT_CHRG_XACT_AMI").cast("Integer"))
df2: org.apache.spark.sql.DataFrame = [ID: string, XACT_DT: string ... 4 more fields]

scala> df2.show()
+---+-------------------+-------------------+-----------------+--------+----+
| ID|            XACT_DT|             DUE_DT|AFT_CHRG_XACT_AMI|BILL_AMT| res|
+---+-------------------+-------------------+-----------------+--------+----+
|  1|         2020-01-11|         2020-01-11|               10|      10|null|
|  2|         2020-01-11|         2020-01-12|               10|      10|null|
|  3|2020-01-11 10:10:01|2020-01-12 10:20:00|               10|      10|null|
+---+-------------------+-------------------+-----------------+--------+----+

scala> val df3 = df.withColumn(colName = "res", df.col("AFT_CHRG_XACT_AMI").cast("Float").cast("Integer"))
df3: org.apache.spark.sql.DataFrame = [ID: string, XACT_DT: string ... 4 more fields]

scala> df3.show()
+---+-------------------+-------------------+-----------------+--------+---+
| ID|            XACT_DT|             DUE_DT|AFT_CHRG_XACT_AMI|BILL_AMT|res|
+---+-------------------+-------------------+-----------------+--------+---+
|  1|         2020-01-11|         2020-01-11|               10|      10| 10|
|  2|         2020-01-11|         2020-01-12|               10|      10| 10|
|  3|2020-01-11 10:10:01|2020-01-12 10:20:00|               10|      10| 10|
+---+-------------------+-------------------+-----------------+--------+---+

Inspecting the output above it is possible to verify that a cast from String to Integer results in res column with null values in df2. df3 shows that casting first to Float and then to Integer produces the desired result.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s