Other Common Tasks
Split Data into Training and Test Datasets
train, test = dataset.randomSplit([0.75, 0.25], seed = 1337)Rename all columns
column_list = data.columns
prefix = "my_prefix"
new_column_list = [prefix + s for s in column_list]
#new_column_list = [prefix + s if s != "ID" else s for s in column_list] ## Use if you plan on joining on an ID later
column_mapping = [[o, n] for o, n in zip(column_list, new_column_list)]
# print(column_mapping)
data = data.select(list(map(lambda old, new: col(old).alias(new),*zip(*column_mapping))))Convert PySpark DataFrame to NumPy array
## Convert `train` DataFrame to NumPy
pdtrain = train.toPandas()
trainseries = pdtrain['features'].apply(lambda x : np.array(x.toArray())).as_matrix().reshape(-1,1)
X_train = np.apply_along_axis(lambda x : x[0], 1, trainseries)
y_train = pdtrain['label'].values.reshape(-1,1).ravel()
## Convert `test` DataFrame to NumPy
pdtest = test.toPandas()
testseries = pdtest['features'].apply(lambda x : np.array(x.toArray())).as_matrix().reshape(-1,1)
X_test = np.apply_along_axis(lambda x : x[0], 1, testseries)
y_test = pdtest['label'].values.reshape(-1,1).ravel()
print(y_test)Call Cognitive Service API using PySpark
Create `chunker` function
The cognitive service APIs can only take a limited number of observations at a time (1,000, to be exact) or a limited amount of data in a single call. So, we can create a chunker function that we will use to split the dataset up into smaller chunks.
Convert Spark DataFrame to Pandas
Set up API requirements
Create DataFrame for incoming scored data
Loop through chunks of the data and call the API
Write the results out to mounted storage
Find All Columns of a Certain Type
Change a Column's Type
Generate StructType Schema Printout (Manual Execution)
Generate StructType Schema from List (Automatic Execution)
Make a DataFrame of Consecutive Dates
Unpivot a DataFrame Dynamically (Longer)
Pivot a wide dataset into a longer form. (Similar to the pivot_longer() function from the tidyr R package or the .wide_to_long method from pandas.)
Last updated
Was this helpful?