
Spark NLP: State of the Art Natural Language Processing
Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with 3700+ pretrained pipelines and models in more than 200+ languages. It also offers tasks such as Tokenization, Word Segmentation, Part-of-Speech Tagging, Word and Sentence Embeddings, Named Entity Recognition, Dependency Parsing, Spell Checking, Text Classification, Sentiment Analysis, Token Classification, Machine Translation (+180 languages), Summarization & Question Answering, and many more NLP tasks.
Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as BERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Google T5, and MarianMT not only to Python and R, but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively.
Project's website
Take a look at our official Spark NLP page: http://nlp.johnsnowlabs.com/ for user documentation and examples
Community support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
- Medium Spark NLP articles
- YouTube Spark NLP video tutorials
Table of contents
Features
- Tokenization
- Trainable Word Segmentation
- Stop Words Removal
- Token Normalizer
- Document Normalizer
- Stemmer
- Lemmatizer
- NGrams
- Regex Matching
- Text Matching
- Chunking
- Date Matcher
- Sentence Detector
- Deep Sentence Detector (Deep learning)
- Dependency parsing (Labeled/unlabeled)
- Part-of-speech tagging
- Sentiment Detection (ML models)
- Spell Checker (ML and DL models)
- Word Embeddings (GloVe and Word2Vec)
- BERT Embeddings (TF Hub & HuggingFace models)
- DistilBERT Embeddings (HuggingFace models)
- RoBERTa Embeddings (HuggingFace models)
- XLM-RoBERTa Embeddings (HuggingFace models)
- Longformer Embeddings (HuggingFace models)
- ALBERT Embeddings (TF Hub & HuggingFace models)
- XLNet Embeddings
- ELMO Embeddings (TF Hub models)
- Universal Sentence Encoder (TF Hub models)
- BERT Sentence Embeddings (TF Hub & HuggingFace models)
- RoBerta Sentence Embeddings (HuggingFace models)
- XLM-RoBerta Sentence Embeddings (HuggingFace models)
- Sentence Embeddings
- Chunk Embeddings
- Unsupervised keywords extraction
- Language Detection & Identification (up to 375 languages)
- Multi-class Sentiment analysis (Deep learning)
- Multi-label Sentiment analysis (Deep learning)
- Multi-class Text Classification (Deep learning)
- BERT for Token Classification
- DistilBERT for Token Classification
- ALBERT for Token Classification
- RoBERTa for Token Classification
- XLM-RoBERTa for Token Classification
- XLNet for Token Classification
- Longformer for Token Classification
- Neural Machine Translation (MarianMT)
- Text-To-Text Transfer Transformer (Google T5)
- Named entity recognition (Deep learning)
- Easy TensorFlow integration
- GPU Support
- Full integration with Spark ML functions
- +2000 pre-trained models in +200 languages!
- +1700 pre-trained pipelines in +200 languages!
- Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, and Urdu.
Requirements
To use Spark NLP you need the following requirements:
- Java 8
- Apache Spark 3.1.x (or 3.0.x, or 2.4.x, or 2.3.x)
NOTE: Java 11 is supported if you are using Spark NLP and Spark/PySpark 3.x and above
GPU (optional):
Spark NLP 3.3.0 is built with TensorFlow 2.4.1 and requires the followings if you need GPU support
Quick Start
This is a quick example of how to use Spark NLP pre-trained pipeline in Python and PySpark:
In Python console or Jupyter kernel:
For more examples, you can visit our dedicated repository to showcase all Spark NLP use cases!
Apache Spark Support
Spark NLP 3.3.0 has been built on top of Apache Spark 3.x while fully supports Apache Spark 2.3.x and Apache Spark 2.4.x:
Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x |
---|---|---|---|---|
3.3.x | YES | YES | YES | YES |
3.2.x | YES | YES | YES | YES |
3.1.x | YES | YES | YES | YES |
3.0.x | YES | YES | YES | YES |
2.7.x | YES | YES | NO | NO |
2.6.x | YES | YES | NO | NO |
2.5.x | YES | YES | NO | NO |
2.4.x | Partially | YES | NO | NO |
1.8.x | Partially | YES | NO | NO |
1.7.x | YES | NO | NO | NO |
1.6.x | YES | NO | NO | NO |
1.5.x | YES | NO | NO | NO |
NOTE: Starting 3.0.0 release, the default and pacakges are based on Scala 2.12 and Apache Spark 3.x by default.
NOTE: Starting the 3.0.0 release, we support all major releases of Apache Spark 2.3.x, Apache Spark 2.4.x, Apache Spark 3.0.x, and Apache Spark 3.1.x
Find out more about versions from our release notes.
Databricks Support
Spark NLP 3.3.0 has been tested and is compatible with the following runtimes:
CPU:
- 5.5 LTS
- 5.5 LTS ML
- 6.4
- 6.4 ML
- 7.3
- 7.3 ML
- 7.4
- 7.4 ML
- 7.5
- 7.5 ML
- 7.6
- 7.6 ML
- 8.0
- 8.0 ML
- 8.1
- 8.1 ML
- 8.2
- 8.2 ML
- 8.3
- 8.3 ML
- 8.4
- 8.4 ML
- 9.0
- 9.0 ML
- 9.1
- 9.01 ML
GPU:
- 8.1 ML & GPU
- 8.2 ML & GPU
- 8.3 ML & GPU
- 8.4 ML & GPU
- 9.0 ML & GPU
- 9.1 ML & GPU
NOTE: Spark NLP 3.3.0 is based on TensorFlow 2.4.x which is compatible with CUDA11 and cuDNN 8.0.2. The only Databricks runtimes supporting CUDA 11. are 8.x ML with GPU.
EMR Support
Spark NLP 3.3.0 has been tested and is compatible with the following EMR releases:
- emr-5.20.0
- emr-5.21.0
- emr-5.21.1
- emr-5.22.0
- emr-5.23.0
- emr-5.24.0
- emr-5.24.1
- emr-5.25.0
- emr-5.26.0
- emr-5.27.0
- emr-5.28.0
- emr-5.29.0
- emr-5.30.0
- emr-5.30.1
- emr-5.31.0
- emr-5.32.0
- emr-5.33.0
- emr-6.1.0
- emr-6.2.0
- emr-6.3.0
Full list of Amazon EMR 5.x releases Full list of Amazon EMR 6.x releases
NOTE: The EMR 6.0.0 is not supported by Spark NLP 3.3.0
Usage
Spark Packages
Command line (requires internet connection)
Spark NLP supports all major releases of Apache Spark 2.3.x, Apache Spark 2.4.x, Apache Spark 3.0.x, and Apache Spark 3.1.x. That's being said, you need to choose the right package for the right Apache Spark major release:
Apache Spark 3.x (3.0.x and 3.1.x - Scala 2.12)
The has been published to the Maven Repository.
The has been published to the Maven Repository.
Apache Spark 2.4.x (Scala 2.11)
The has been published to the Maven Repository.
The has been published to the Maven Repository.
Apache Spark 2.3.x (Scala 2.11)
The has been published to the Maven Repository.
The has been published to the Maven Repository.
NOTE: In case you are using large pretrained models like UniversalSentenceEncoder, you need to have the following set in your SparkSession:
Scala
Spark NLP supports Scala 2.11.x if you are using Apache Spark 2.3.x or 2.4.x and Scala 2.12.x if you are using Apache Spark 3.0.x or 3.1.x. Our packages are deployed to Maven central. To add any of our packages as a dependency in your application you can follow these coordinates:
Maven
spark-nlp on Apache Spark 3.x:
spark-nlp-gpu:
spark-nlp on Apache Spark 2.4.x:
spark-nlp-gpu:
spark-nlp on Apache Spark 2.3.x:
spark-nlp-gpu:
SBT
spark-nlp on Apache Spark 3.x.x:
spark-nlp-gpu:
spark-nlp on Apache Spark 2.4.x:
spark-nlp-gpu:
spark-nlp on Apache Spark 2.3.x:
spark-nlp-gpu:
Maven Central: https://mvnrepository.com/artifact/com.johnsnowlabs.nlp
If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your projects Spark NLP SBT Starter
Python
Spark NLP supports Python 3.6.x and 3.7.x if you are using PySpark 2.3.x or 2.4.x and Python 3.8.x if you are using PySpark 3.x.
Python without explicit Pyspark installation
Pip/Conda
If you installed pyspark through pip/conda, you can install through the same channel.
Pip:
Conda:
PyPI spark-nlp package / Anaconda spark-nlp package
Then you'll have to create a SparkSession either from Spark NLP:
or manually:
If using local jars, you can use instead for comma-delimited jar files. For cluster setups, of course, you'll have to put the jars in a reachable location for all driver and executor nodes.
Quick example:
Compiled JARs
Build from source
spark-nlp
- FAT-JAR for CPU on Apache Spark 3.x.x
- FAT-JAR for GPU on Apache Spark 3.x.x
- FAT-JAR for CPU on Apache Spark 2.4.x
- FAT-JAR for GPU on Apache Spark 2.4.x
- FAT-JAR for CPU on Apache Spark 2.3.x
- FAT-JAR for GPU on Apache Spark 2.3.x
Using the jar manually
If for some reason you need to use the JAR, you can either download the Fat JARs provided here or download it from Maven Central.
To add JARs to spark programs use the option:
The preferred way to use the library when running spark programs is using the option as specified in the section.
Apache Zeppelin
Use either one of the following options
- Add the following Maven Coordinates to the interpreter's library list
- Add a path to pre-built jar from here in the interpreter's library list making sure the jar is available to driver path
Spark NLP
Text processing programming library
Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages.[2][3][4] The library is built on top of Apache Spark and its Spark ML library.[5]
Its purpose is to provide an API for natural language processing pipelines that implements recent academic research results as production-grade, scalable, and trainable software. The library offers pre-trained neural network models, pipelines, and embeddings, as well as support for training custom models.[5]
Features[edit]
The design of the library make use of the concept of a pipeline which is an ordered set of text annotators.[6] Out of the box annotators include, tokenizer, normalizer, stemming, lemmatizer, regular expression, TextMatcher, chunker, DateMatcher, SentenceDetector, DeepSentenceDetector, POS tagger, ViveknSentimentDetector, sentiment analysis, named entity recognition, conditional random field annotator, deep learning annotator, spell checking and correction, dependency parser, typed dependency parser, document classification, and language detection.[7]
The Models Hub is a platform for sharing open-source as well as licensed pretrained models and pipelines. It includes pre-trained pipelines with tokenization, lemmatization, part-of-speech tagging, and named entity recognition exist for more than thirteen languages; word embeddings including GloVe, ELMo, BERT, ALBERT, XLNet, Small BERT, and ELECTRA; sentence embeddings including Universal Sentence Embeddings (USE)[8] and Language Agnostic BERT Sentence Embeddings (LaBSE).[9] It also includes resources and pre-trained models for more than two hundred languages. Spark NLP base code includes support for East Asian languages such as tokenizers for Chinese, Japanese, Korean; for right-to-left languages such as Urdu, Farsi, Arabic, Hebrew and pre-trained multilingual word and sentence embeddings such as LaUSE and a translation annotator.
Usage in healthcare[edit]
Spark NLP for Healthcare is a commercial extension of Spark NLP for clinical and biomedical text mining.[10] It provides healthcare-specific annotators, pipelines, models, and embeddings for clinical entity recognition, clinical entity linking, entity normalization, assertion status detection, de-identification, relation extraction, and spell checking and correction.
The library offers access to several clinical and biomedical transformers: JSL-BERT-Clinical, BioBERT, ClinicalBERT,[11] GloVe-Med, GloVe-ICD-O. It also includes over 50 pre-trained healthcare models, that can recognize the entities such as clinical, drugs, risk factors, anatomy, demographics, and sensitive data.
Spark OCR[edit]
Spark OCR is another commercial extension of Spark NLP for optical character recognition (OCR) from images, scanned PDF documents, and DICOM files.[7] It is a software library built on top of Apache Spark. It provides several image pre-processing features for improving text recognition results such as adaptive thresholding and denoising, skew detection & correction, adaptive scaling, layout analysis and region detection, image cropping, removing background objects.
Due to the tight coupling between Spark OCR and Spark NLP, users can combine NLP and OCR pipelines for tasks such as extracting text from images, extracting data from tables, recognizing and highlighting named entities in PDF documents or masking sensitive text in order to de-identify images.[12]
Several output formats are supported by Spark OCR such as PDF, images, or DICOM files with annotated or masked entities, digital text for downstream processing in Spark NLP or other libraries, structured data formats (JSON and CSV), as files or Spark data frames.
Users can also distribute the OCR jobs across multiple nodes in a Sparkcluster.
License and availability[edit]
Spark NLP is licensed under the Apache 2.0 license. The source code is publicly available on GitHub as well as documentation and a tutorial. Prebuilt versions of Spark NLP are available in PyPi and Anaconda Repository for Python development, in Maven Central for Java & Scala development, and in Spark Packages for Spark development.
Award[edit]
In March 2019, Spark NLP received Open Source Award for its contributions in natural language processing in Python, Java, and Scala.[13]
References[edit]
- ^Talby, David. "Introducing the Natural Language Processing Library for Apache Spark". databricks.com. databricks. Retrieved 29 March 2019.
- ^Ellafi, Saif Addin (2018-02-28). "Comparing production-grade NLP libraries: Running Spark-NLP and spaCy pipelines". O'Reilly Media. Retrieved 2019-03-29.
- ^Ellafi, Saif Addin (2018-02-28). "Comparing production-grade NLP libraries: Accuracy, performance, and scalability". O'Reilly Media. Retrieved 2019-03-29.
- ^Ewbank, Kay. "Spark Gets NLP Library". www.i-programmer.info.
- ^ abThomas, Alex (July 2020). Natural Language Processing with Spark NLP: Learning to Understand Text at Scale (First ed.). United States of America: O'Reilly Media. ISBN .
- ^Talby, David (2017-10-19). "Introducing the Natural Language Processing Library for Apache Spark - The Databricks Blog". Databricks. Retrieved 2019-08-27.
- ^ abJha, Bineet Kumar; G, Sivasankari G.; R, Venugopal K. (May 2, 2021). "Sentiment Analysis for E-Commerce Products Using Natural Language Processing". Annals of the Romanian Society for Cell Biology: 166–175 – via www.annalsofrscb.ro.
- ^Cer, Daniel; Yang, Yinfei; Kong, Sheng-yi; Hua, Nan; Limtiaco, Nicole; John, Rhomni St; Constant, Noah; Guajardo-Cespedes, Mario; Yuan, Steve; Tar, Chris; Sung, Yun-Hsuan; Strope, Brian; Kurzweil, Ray (12 April 2018). "Universal Sentence Encoder". arXiv:1803.11175 [cs.CL].
- ^Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (3 July 2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
- ^Team, Editorial (2018-09-04). "The Use of NLP to Extract Unstructured Medical Data From Text". insideBIGDATA. Retrieved 2019-08-27.
- ^Alsentzer, Emily; Murphy, John; Boag, William; Weng, Wei-Hung; Jindi, Di; Naumann, Tristan; McDermott, Matthew (June 2019). "Publicly Available Clinical BERT Embeddings". Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics: 72–78. arXiv:1904.03323. doi:10.18653/v1/W19-1909. S2CID 102352093.
- ^"A Unified CV, OCR & NLP Model Pipeline for Document Understanding at DocuSign". NLP Summit. Retrieved 18 September 2020.
- ^https://www.oreilly.com/pub/pr/3277
Sources[edit]
External links[edit]
Spark NLP Workshop
Showcasing notebooks and codes of how to use Spark NLP in Python and Scala.
Table of contents
Python Setup
Colab setup
Main repository
https://github.com/JohnSnowLabs/spark-nlp
Project's website
Take a look at our official spark-nlp page: http://nlp.johnsnowlabs.com/ for user documentation and examples
Slack community channel
Join Slack
Contributing
If you find any example that is no longer working, please create an issue.
License
Apache Licence 2.0
Spark NLP: State of the Art Natural Language Processing
The most widely used NLP library in the enterprise

Source:2020 NLP Industry Survey, by Gradient Flow.
100% Open Source
Including pre-trained models and pipelines
Natively scalable
The only NLP library built natively on Apache Spark
Multiple Languages
Full Python, Scala, and Java support
Transformers at Scale
Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as BERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Google T5, and MarianMT not only to Python, and R but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively
Right Out of The Box
Spark NLP ships with many NLP features, pre-trained models and pipelines
NLP Features
- Tokenization
- Word Segmentation
- Stop Words Removal
- Normalizer
- Stemmer
- Lemmatizer
- NGrams
- Regex Matching
- Text Matching
- Chunking
- Date Matcher
- Part-of-speech tagging
- Sentence Detector (DL models)
- Dependency parsing
- Sentiment Detection (ML models)
- Spell Checker (ML and DL models)
- Word Embeddings (GloVe and Word2Vec)
- BERT Embeddings
- DistilBERT Embeddings
- RoBERTa Embeddings
- XLM-RoBERTa Embeddings
- Longformer Embeddings
- ALBERT Embeddings
- XLNet Embeddings
- ELMO Embeddings
- Universal Sentence Encoder
- Sentence Embeddings
- Chunk Embeddings
- Neural Machine Translation (MarianMT)
- Text-To-Text Transfer Transformer (Google T5)
- Unsupervised keywords extraction
- Language Detection & Identification (up to 375 languages)
- Multi-class Text Classification (DL model)
- Multi-label Text Classification (DL model)
- Multi-class Sentiment Analysis (DL model)
- BERT for Token Classification
- DistilBERT for Token Classification
- ALBERT for Token Classification
- RoBERTa for Token Classification
- XLM-RoBERTa for Token Classification
- XLNet for Token Classification
- Longformer for Token Classification
- Named entity recognition (DL model)
- Easy TensorFlow integration
- GPU Support
- Full integration with Spark ML functions
- 2000+ pre-trained models in 200+ languages!
- 1700+ pre-trained pipelines in 200+ languages!
Benchmark
Spark NLP 3.x obtained the best performing academic peer-reviewed results
Training NER
- State-of-the-art Deep Learning algorithms
- Achieve high accuracy within a few minutes
- Achieve high accuracy with a few lines of codes
- Blazing fast training
- Use CPU or GPU
- 80+ Pretrained Embeddings including GloVe, Word2Vec, BERT, DistilBERT, RoBERTa, XLM-RoBERTa, Longformer, ELMO, ELECTRA, ALBERT, XLNet, BioBERT, etc.
- Multi-lingual NER models in Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, and Urdu
SYSTEM | YEAR | LANGUAGE | CONLL ‘03 |
---|---|---|---|
Spark NLP v3 | 2021 | Python/Scala/Java/R | 93.2 (test F1) 95 (dev F1) |
spaCy v3 | 2021 | Python | 91.6 |
Stanza (StanfordNLP) | 2020 | Python | 92.1 |
Flair | 2018 | Python | 93.1 |
CoreNLP | 2015 | Java | 89.6 |
SYSTEM | YEAR | LANGUAGE | ONTONOTES |
---|---|---|---|
Spark NLP v3 | 2021 | Python/Scala/Java/R | 90.0 (test F1) 92.5 (dev F1) |
spaCy RoBERTa | 2020 | Python | 89.7 (dev F1) |
Stanza (StanfordNLP) | 2020 | Python | 88.8 (dev F1) |
Flair | 2018 | Python | 89.7 |
Nlp labs john snow
Product Overview
Ready to use NLP Server for analyzing text documents using NLU library. All Spark NLP pre-trained models and pipelines are easy to use via a simple and intuitive UI, without writing a line of code. For more expert users and more complex tasks, NLP Server also provides a REST API that can be used to process high amounts of data.
With NLP Server you get access to state-of-the-art 3000+ models in over 200+ languages for problems like Sentiment Analysis, Spell Checking, Text Summarization, Translation, Named Entity Recognition, Question Answering, Spam Classifiers and much more! All those features are available in any programming language via simple Rest API calls.
Exploit the latest research developments in NLP from the Transformers world with LongFormers, XLM-RoBerta, XLING, Electra, LaBSE, DistilBERT, XLNET , USE, T5, Marian and much more!
Operating System
Linux/Unix, Ubuntu 20.04
Highlights
- Spark NLP text annotation
Pricing Information
Usage Information
Support Information
Customer Reviews
At the same time he made her look into his eyes. He made her drink from the neck of the bottle. Each time, before touching the neck, she had to stick it into the vagina. Then lick the neck and, finally, drink. She was ashamed to look him in the eye, she blushed.
You will also be interested:
- Minecraft hosting servers
- New river gorge tripadvisor
- Pontiac vibe rear differential
- Z51 license plate frame
- Liquid white paint michaels
- Freightliner rear leaf spring
- Cornrow bob braids hairstyles
- Unreal engine 4 forest map download
- Integrated math 1 workbook
- Black lean to greenhouse
- M885 ammo
- Cow print blouse
- 2003 yamaha r6 manual
After a while, he already sat down on my fingers with his ass. I felt how hard his ass was squeezing my fingers, his. Moans turned me on stronger and stronger and I already really wanted to fuck him. I took him by the hips, pressed his chest a little against the wall.