Text Data Preparation
Tokenization and Vectorization
Load in required libraries
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizerRemove usernames, dates, links, etc.
def clean_text(c):
c = lower(c)
c = regexp_replace(c, "(https?\://)\S+", "") # Remove links
c = regexp_replace(c, "(\\n)|\n|\r|\t", "") # Remove CR, tab, and LR
c = regexp_replace(c, "(?:(?:[0-9]{2}[:\/,]){2}[0-9]{2,4})", "") # Remove dates
c = regexp_replace(c, "@([A-Za-z0-9_]+)", "") # Remove usernames
c = regexp_replace(c, "[0-9]", "") # Remove numbers
c = regexp_replace(c, "\:|\/|\#|\.|\?|\!|\&|\"|\,", "") # Remove symbols
#c = regexp_replace(c, "(@[A-Za-z0-9_]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "")
return c
dataset = dataset.withColumn("text", clean_text(col("text")))RegEx tokenization
Remove stop words
Count words
Index strings
Create transformation pipeline
Extras
Get label numbers for each class
Split text body into sentences
Create `part_number` for the split sentences
Last updated
Was this helpful?