SparkNLP is built on top of Apache Spark 2.3.0 and works with any user provided Spark 2.x.x it is advised to have basic knowledge of the framework and a working environment before using Spark-NLP. Refer to its documentation to get started with Spark.
To start using the library, execute any of the following lines depending on your desired use case:
spark-shell --packages JohnSnowLabs:spark-nlp:1.6.1
pyspark --packages JohnSnowLabs:spark-nlp:1.6.1
spark-submit --packages JohnSnowLabs:spark-nlp:1.6.1
Using Databricks cloud cluster or an Apache Zeppelin Scala notebook? Add the following Maven coordinates in the appropriate menu:
com.johnsnowlabs.nlp:spark-nlp_2.11:1.6.1
For Python in Apache Zeppelin you may need to setup SPARK_SUBMIT_OPTIONS utilizing --packages instruction shown above like this
export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.6.1"
Python Jupyter Notebook with PySpark? add the following environment variables (depending on your OS)
export SPARK_HOME=/path/to/your/spark/folder
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages JohnSnowLabs:spark-nlp:1.6.1
Python without explicit Spark Installationk? Use pip to install (after you pip installed pyspark)
pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.6.1
In this way, you will have to start SparkSession in your python program manually, this is an example
spark = SparkSession.builder \
.appName("ner")\
.master("local[*]")\
.config("spark.driver.memory","4G")\
.config("spark.driver.maxResultSize", "2G") \
.config("spark.driver.extraClassPath", "lib/spark-nlp-assembly-1.6.1.jar")\
.config("spark.kryoserializer.buffer.max", "500m")\
.getOrCreate()
Pre-compiled Spark-NLP assembly fat-jar for using in standalone projects, may be downloaded here Non-fat-jar may be downloaded here then, run spark-shell or spark-submit with appropriate --jars /path/to/spark-nlp_2.11-1.6.1.jar to use the library in spark.
For further alternatives and documentation check out our README page in GitHub.
Spark ML provides a set of Machine Learning applications, and it's logic consists of two main components: Estimators and Transformers. The first, have a method called fit() which secures and trains a piece of data to such application, and a Transformer, which is generally the result of a fitting process, applies changes to the the target dataset. These components have been embedded to be applicable to Spark NLP. Pipelines are a mechanism that allow multiple estimators and transformers within a single workflow, allowing multiple chained transformations along a Machine Learning task. Refer to SparkML library for more information.
An annotation is the basic form of the result of a Spark-NLP operation. It's structure is made of:
This object is automatically generated by annotators after a transform process. No manual work is required. But it must be understood in order to use it efficiently.
Annotators are the spearhead of NLP functionalities in SparkNLP. There are two forms of annotators:
Both forms of annotators can be included in a Pipeline and will automatically go through all stages in the provided order and transform the data accordingly. A Pipeline is turned into a PipelineModel after the fit() stage. Either before or after can be saved and re-loaded to disk at any time.
A basic pipeline is a downloadable Spark ML pipeline ready to compute some annotations for you right away. Keep in mind this is trained with generic data and is not meant for production use, but should give you an insight of how the library works at a glance.
Basic pipelines allow to inject either common strings or Spark dataframes. Either way, SparkNLP will execute the right pipeline type to bring back the results. Let's retrieve it:
import com.johnsnowlabs.nlp.pretrained.pipelines.en.BasicPipeline
val annotations = BasicPipeline().annotate(Array("We are very happy about SparkNLP", "And this is just another sentence"))
annotations.foreach(println(_, "\n"))
(Map(lemma -> List(We, be, very, happy, about, SparkNLP), document -> List(We are very happy about SparkNLP), normal -> List(We, are, very, happy, about, SparkNLP), pos -> ArrayBuffer(PRP, VBP, RB, JJ, IN, NNP), token -> List(We, are, very, happy, about, SparkNLP)))
(Map(lemma -> List(And, this, be, just, another, sentence), document -> List(And this is just another sentence), normal -> List(And, this, is, just, another, sentence), pos -> ArrayBuffer(CC, DT, VBZ, RB, DT, NN), token -> List(And, this, is, just, another, sentence)))
Since this is Apache Spark, we should make use of our DataFrame friends. Pretraned pipelines allow us to do so with the same function, just adding the target string column. The result here will be a dataframe with many annotation columns.
import spark.implicits._
val data = Seq("hello, this is an example sentence").toDF("mainColumn")
BasicPipeline().annotate(data, "mainColumn").show()
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| text| document| token| normal| lemma| pos|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|hello, this is an...|[[document, 0, 33...|[[token, 0, 4, he...|[[token, 0, 4, he...|[[token, 0, 4, he...|[[pos, 0, 4, UH, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
To add a bit of challenge, the output of the previous DataFrame was in terms of Annotation objects. What if we want to deal with just the resulting annotations? We can use the Finisher annotator, retrieve the BasicPipeline, and add them together in a Spark ML Pipeline. Note BasicPipeline expect the target column to be named "text".
import com.johnsnowlabs.nlp.Finisher
import org.apache.spark.ml.Pipeline
val finisher = new Finisher().
setInputCols("token", "normal", "lemma", "pos")
val basicPipeline = BasicPipeline().pretrained()
val pipeline = new Pipeline().
setStages(Array(
basicPipeline,
finisher
))
pipeline.
fit(spark.emptyDataFrame).
transform(data.withColumnRenamed("mainColumn", "text")).
show()
+--------------------+--------------------+--------------------+--------------------+--------------------+
| text| finished_token| finished_normal| finished_lemma| finished_pos|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|hello, this is an...|[hello, ,, this, ...|[hello, this, is,...|[hello, ,, this, ...|[UH, DT, VBZ, DT,...|
+--------------------+--------------------+--------------------+--------------------+--------------------+
Every annotator has a type. Those annotators that share a type, can be used interchangeably, meaning you could you use any of them when needed. For example, when a token type annotator is required by another annotator, such as a sentiment analysis annotator, you can either provide a normalized token or a lemma, as both are of type token.
Since version 1.5.0 we are making necessary imports easy to reach, base._ will include general Spark NLP transformers and concepts, while annotator._ will include all annotators that we currently provide. We also need SparkML pipelines.
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
In order to get through the NLP process, we need to get raw data annotated. There is a special transformer that does this for us: the DocumentAssembler, it creates the first annotation of type Document which may be used by annotators down the road
val documentAssembler = new DocumentAssembler().
setInputCol("text").
setOutputCol("document")
In this quick example, we now proceed to identify the sentences in each of our document lines. SentenceDetector requires a Document annotation, which is provided by the DocumentAssembler output, and it's itself a Document type token. The Tokenizer requires a Document annotation type, meaning it works both with DocumentAssembler or SentenceDetector output, in here, we use the sentence output.
val sentenceDetector = new SentenceDetector().
setInputCols(Array("document")).
setOutputCol("sentence")
val regexTokenizer = new Tokenizer().
setInputCols(Array("sentence")).
setOutputCol("token")
Now we want to put all this together and retrieve the results, we use a Pipeline for this. We also include another special transformer, called Finisher to show tokens in a human language. We use an emptyDataFrame in fit() since none of the pipeline stages have a training stage.
val testData = Seq("Lorem ipsum dolor sit amet, " +
"consectetur adipiscing elit, sed do eiusmod tempor " +
"incididunt ut labore et dolore magna aliqua.").toDF("text")
val finisher = new Finisher().
setInputCols("token").
setCleanAnnotations(false)
val pipeline = new Pipeline().
setStages(Array(
documentAssembler,
sentenceDetector,
regexTokenizer,
finisher
))
pipeline.
fit(Seq.empty[String].toDF("text")).
transform(Seq("hello, this is an example sentence").toDF("text")).
show()
LightPipeline is a Spark-NLP specific Pipeline class equivalent to SparkML's Pipeline. The difference is that it's execution does not hold to Spark principles, instead it computes everything locally (but in parallel) in order to achieve fast results when dealing with small amounts of data. This means, we do not input a Spark Dataframe, but a string or an Array of strings instead, to be annotated. To create Light Pipelines, you need to input an already trained (fit) Spark ML Pipeline. It's transform() stage is converted into annotate() instead.
import com.johnsnowlabs.nlp.base._
val trainedModel = pipeline.fit(Seq.empty[String].toDF("text"))
val lightPipeline = new LightPipeline(trainedModel)
lightPipeline.annotate("Hello world, please annotate my text")
First, either build from source or download the following standalone jar module (works both from Spark-NLP python and scala): Spark-NLP-OCR And add it to your Spark environment (with --jars or spark.driver.extraClassPath and spark.executor.extraClassPath configuration) Second, if your PDFs don't have a text layer (this depends on how PDFs were created), the library will use Tesseract 4.0 on background. Tesseract will utilize native libraries, so you'll have to get them installed in your system.
You can use OcrHelper to directly create spark dataframes from PDF. This will hold entire documents in single rows, meant to be later processed by a SentenceDetector. This way, you won't be breaking the content in rows as if you were reading a standard document. Metadata column will include page numbers and file name information per row.
import com.johnsnowlabs.nlp.util.io.OcrHelper
val data = OcrHelper.createDataset(spark, "/pdfs/", "text", "metadata")
val documentAssembler = new DocumentAssembler().setInputCol("text").setMetadataCol("metadata")
documentAssembler.transform(data).show()
Another way, would be to simply create an array of strings. This is useful for example if you are parsing a small amount of pdf files and would like to use LightPipelines instead. See an example below.
import com.johnsnowlabs.nlp.util.io.OcrHelper
val raw = OcrHelper.createMap("/pdfs/")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
val lightPipeline = new LightPipeline(new Pipeline().setStages(Array(documentAssembler, sentenceDetector)).fit(Seq.empty[String].toDF("text")))
pipeline.annotate(raw.values.toArray)
Training your own annotators is the most key concept when dealing with real life scenarios. Any of the annotators provided above, such as pretrained pipelines and models, will rarely ever apply to a specific use case. Dealing with real life problems will require training your own models. In Spark-NLP, training annotators will vary depending on the annotators. Currently, we support three ways:
You can checkout the reference for pretrained models, and to check which params you can set in order to train your own.