In a previous post, we glimpsed briefly at creating and manipulating Spark dataframes from CSV files. In the couple of months since, Spark has already gone from version 1.3.0 to 1.5, with more than 100 built-in functions introduced in Spark 1.5 alone; so, we thought it is a good time for revisiting the subject, this time also utilizing the external …
Unexpected behavior of Spark dataframe filter method
[EDIT: Thanks to this post, the issue reported here has been resolved since Spark 1.4.1 – see the comments below] While writing the previous post on Spark dataframes, I encountered an unexpected behavior of the respective .filter method; but, on the one hand, I needed some more time to experiment and confirm it and, on the other hand, I knew …
Spark data frames from CSV files: handling headers & column types
If you come from the R (or Python/pandas) universe, like me, you must implicitly think that working with CSV files must be one of the most natural and straightforward things to happen in a data analysis context. Indeed, if you have your data in a CSV file, practically the only thing you have to do from R is to fire …