Other Common Tasks

Split Data into Training and Test Datasets

train, test = dataset.randomSplit([0.75, 0.25], seed = 1337)

Rename all columns

column_list = data.columns
prefix = "my_prefix"
new_column_list = [prefix + s for s in column_list]
#new_column_list = [prefix + s if s != "ID" else s for s in column_list] ## Use if you plan on joining on an ID later
 
column_mapping = [[o, n] for o, n in zip(column_list, new_column_list)]

# print(column_mapping)

data = data.select(list(map(lambda old, new: col(old).alias(new),*zip(*column_mapping))))

Convert PySpark DataFrame to NumPy array

## Convert `train` DataFrame to NumPy
pdtrain = train.toPandas()
trainseries = pdtrain['features'].apply(lambda x : np.array(x.toArray())).as_matrix().reshape(-1,1)
X_train = np.apply_along_axis(lambda x : x[0], 1, trainseries)
y_train = pdtrain['label'].values.reshape(-1,1).ravel()

## Convert `test` DataFrame to NumPy
pdtest = test.toPandas()
testseries = pdtest['features'].apply(lambda x : np.array(x.toArray())).as_matrix().reshape(-1,1)
X_test = np.apply_along_axis(lambda x : x[0], 1, testseries)
y_test = pdtest['label'].values.reshape(-1,1).ravel()

print(y_test)

Call Cognitive Service API using PySpark

Create `chunker` function

The cognitive service APIs can only take a limited number of observations at a time (1,000, to be exact) or a limited amount of data in a single call. So, we can create a chunker function that we will use to split the dataset up into smaller chunks.

Convert Spark DataFrame to Pandas

Set up API requirements

Create DataFrame for incoming scored data

Loop through chunks of the data and call the API

Write the results out to mounted storage

Find All Columns of a Certain Type

Change a Column's Type

Generate StructType Schema Printout (Manual Execution)

Generate StructType Schema from List (Automatic Execution)

Make a DataFrame of Consecutive Dates

Unpivot a DataFrame Dynamically (Longer)

Pivot a wide dataset into a longer form. (Similar to the pivot_longer() function from the tidyr R package or the .wide_to_long method from pandas.)

Last updated

Was this helpful?