First, let’s install and import NLTK and download the necessary resources.
! pip install nltk
Requirement already satisfied: nltk in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (3.9.1)
Requirement already satisfied: click in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from nltk) (8.1.8)
Requirement already satisfied: joblib in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from nltk) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from nltk) (2024.11.6)
Requirement already satisfied: tqdm in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from nltk) (4.67.1)
Requirement already satisfied: colorama in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from click->nltk) (0.4.6)
[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip
import nltkimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom IPython.display import display, HTMLnltk.download('punkt')nltk.download('stopwords')nltk.download('wordnet')nltk.download('averaged_perceptron_tagger')nltk.download('maxent_ne_chunker')nltk.download('words')
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data] C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data] Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data] C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data] Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data] C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data] Package words is already up-to-date!
[nltk_data] Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data] C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data] Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data] C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data] Package words is already up-to-date!
True
Sample Text Data
Let’s create a sample corpus to work with throughout this notebook.
corpus ="""This is a sample text corpus.It contains multiple sentences.The purpose of this corpus is to demonstrate text processing."""
# Display the corpusprint("Sample corpus:")corpus
Sample corpus:
'This is a sample text corpus.\nIt contains multiple sentences.\nThe purpose of this corpus is to demonstrate text processing.\n'
Tokenization
Tokenization is the process of breaking down text into smaller units, such as sentences or words. This is typically the first step in any NLP pipeline.
Sentence Tokenization
Sentence tokenization splits a paragraph or document into individual sentences.
from nltk.tokenize import sent_tokenize# Tokenize corpus into sentencessentences = sent_tokenize(corpus)print(f"Number of sentences: {len(sentences)}")for i, sentence inenumerate(sentences, 1):print(f"Sentence {i}: {sentence}")
Number of sentences: 3
Sentence 1: This is a sample text corpus.
Sentence 2: It contains multiple sentences.
Sentence 3: The purpose of this corpus is to demonstrate text processing.
# Check type of documentprint(f"Type of tokenized sentences: {type(sentences)}")
Type of tokenized sentences: <class 'list'>
Word Tokenization
Word tokenization splits sentences into individual words. NLTK offers several tokenizers with different behaviors.
# Word tokenization (sentence → words)from nltk.tokenize import word_tokenize# Tokenize each sentence into wordsprint("Word tokenization results:")for i, sentence inenumerate(sentences, 1): words = word_tokenize(sentence)print(f"Sentence {i}: {words}")
NLTK provides various tokenizers, each with different rules and behaviors. Let’s compare them:
from nltk.tokenize import wordpunct_tokenize, TreebankWordTokenizerimport pandas as pd # Sample text for comparisonsample ="Don't hesitate to email me at john.doe@example.com or call at 555-123-4567!"# Compare different tokenizerstokenizers = {'word_tokenize': word_tokenize,'wordpunct_tokenize': wordpunct_tokenize,'TreebankWordTokenizer': TreebankWordTokenizer().tokenize}# Create a DataFrame to display resultsresults = {}max_length =0# Tokenize and find the maximum length of tokenized resultsfor name, tokenizer in tokenizers.items(): tokenized = tokenizer(sample) results[name] = tokenized max_length =max(max_length, len(tokenized))# Pad tokenized results to make all arrays the same lengthfor name in results: results[name] += [None] * (max_length -len(results[name]))# Display results as a DataFramepd.DataFrame(results).T
0
1
2
3
4
5
6
7
8
9
...
14
15
16
17
18
19
20
21
22
23
word_tokenize
Do
n't
hesitate
to
email
me
at
john.doe
@
example.com
...
!
None
None
None
None
None
None
None
None
None
wordpunct_tokenize
Don
'
t
hesitate
to
email
me
at
john
.
...
com
or
call
at
555
-
123
-
4567
!
TreebankWordTokenizer
Do
n't
hesitate
to
email
me
at
john.doe
@
example.com
...
!
None
None
None
None
None
None
None
None
None
3 rows × 24 columns
Stemming
Stemming is the process of reducing words to their word stem or root form. It’s a rule-based process that chops off the ends of words to remove affixes. Stemming is useful for text normalization, but often produces non-dictionary words.
Comparing Different Stemmers
from nltk.stem import PorterStemmer, RegexpStemmer, SnowballStemmer# Create sample words to compare stemmerswords = ["fearly", "running", "ran", "easily", "fairness", "eating", "eats", "eater", "eat", "history", "historical", "congratulations", "sliding", "comfortable"]
import pandas as pd # Initialize stemmersporter_stemmer = PorterStemmer()regexp_stemmer = RegexpStemmer('ing$|s$|e$|able', min=4)snowball_stemmer = SnowballStemmer(language="english")# Create comparison tablestemming_results = {'Original': words,'Porter': [porter_stemmer.stem(word) for word in words],'RegExp': [regexp_stemmer.stem(word) for word in words],'Snowball': [snowball_stemmer.stem(word) for word in words]}# Display resultsstemming_df = pd.DataFrame(stemming_results)stemming_df
Original
Porter
RegExp
Snowball
0
fearly
fearli
fearly
fear
1
running
run
runn
run
2
ran
ran
ran
ran
3
easily
easili
easily
easili
4
fairness
fair
fairnes
fair
5
eating
eat
eat
eat
6
eats
eat
eat
eat
7
eater
eater
eater
eater
8
eat
eat
eat
eat
9
history
histori
history
histori
10
historical
histor
historical
histor
11
congratulations
congratul
congratulation
congratul
12
sliding
slide
slid
slide
13
comfortable
comfort
comfort
comfort
Stemming Analysis
As you can see from the results:
Porter Stemmer: One of the oldest and simplest stemmers, it applies a set of rules to remove suffixes.
RegExp Stemmer: Uses regular expressions to strip specified patterns from the end of words. It’s simple but less comprehensive.
Snowball Stemmer: An improved version of the Porter algorithm, also known as Porter2, offering better accuracy for English and support for multiple languages.
Notice how stemming can sometimes produce non-dictionary words (e.g., “histori” for “history”). This is one of the main drawbacks of stemming compared to lemmatization.
Lemmatization
Lemmatization is similar to stemming, but it reduces words to their dictionary form (lemma) rather than just chopping off affixes. It considers the morphological analysis of the words and produces actual dictionary words.
Lemmatization is often preferred for applications like chatbots, Q&A systems, and text summarization because it preserves the meaning of words.
from nltk.stem import WordNetLemmatizer# Initialize the WordNet Lemmatizerlemmatizer = WordNetLemmatizer()
import pandas as pd# Lemmatization with different POS (Part-of-Speech) tags# POS tags: n-noun, v-verb, a-adjective, r-adverbpos_tags = {'n': 'noun', 'v': 'verb', 'a': 'adjective', 'r': 'adverb'}# Test words for lemmatizationlemma_words = ["running", "ran", "better", "studies", "studied", "feet", "children", "geese", "mice", "are", "is", "was", "fairly"]# Create comparison table for different POS tagslemma_results = {'Original': lemma_words}for pos_tag, name in pos_tags.items(): lemma_results[f'Lemma ({name})'] = [lemmatizer.lemmatize(word, pos=pos_tag) for word in lemma_words]# Display resultspd.DataFrame(lemma_results)
Original
Lemma (noun)
Lemma (verb)
Lemma (adjective)
Lemma (adverb)
0
running
running
run
running
running
1
ran
ran
run
ran
ran
2
better
better
better
good
well
3
studies
study
study
studies
studies
4
studied
studied
study
studied
studied
5
feet
foot
feet
feet
feet
6
children
child
children
children
children
7
geese
goose
geese
geese
geese
8
mice
mouse
mice
mice
mice
9
are
are
be
are
are
10
is
is
be
is
is
11
was
wa
be
was
was
12
fairly
fairly
fairly
fairly
fairly
Stemming vs. Lemmatization
Let’s compare stemming and lemmatization side by side to see the differences:
# Compare stemming vs lemmatizationcompare_words = ["running", "better", "studies", "feet", "wolves", "are", "historically"]comparison = {'Original': compare_words,'Porter Stemmer': [porter_stemmer.stem(word) for word in compare_words],'Snowball Stemmer': [snowball_stemmer.stem(word) for word in compare_words],'Lemmatization (verb)': [lemmatizer.lemmatize(word, pos='v') for word in compare_words],'Lemmatization (noun)': [lemmatizer.lemmatize(word, pos='n') for word in compare_words]}pd.DataFrame(comparison)
Original
Porter Stemmer
Snowball Stemmer
Lemmatization (verb)
Lemmatization (noun)
0
running
run
run
run
running
1
better
better
better
better
better
2
studies
studi
studi
study
study
3
feet
feet
feet
feet
foot
4
wolves
wolv
wolv
wolves
wolf
5
are
are
are
be
are
6
historically
histor
histor
historically
historically
Stopword Removal
Stopwords are common words like “the”, “a”, “an”, “in” that usually don’t carry much meaning in text analysis. Removing them can help reduce noise in text processing.
# Sample paragraph for stopword removalparagraph ="""On July 16, 1969, the Apollo 11 spacecraft launched from the Kennedy Space Center in Florida. Its mission was to go where no human being had gone before—the moon! The crew consisted of Neil Armstrong, Michael Collins, and Buzz Aldrin. The spacecraft landed on the moon in the Sea of Tranquility, a basaltic flood plain, on July 20, 1969. The moonwalk took place the following day. On July 21, 1969, at precisely 10:56 EDT, Commander Neil Armstrong emerged from the Lunar Module and took his famous first step onto the moon's surface. He declared, It was a monumental moment in human history!"""
from nltk.corpus import stopwords# Get English stopwordsstop_words = stopwords.words('english')# Display first 20 stopwordsprint(f"Total English stopwords: {len(stop_words)}")print(f"Sample stopwords: {stop_words[:20]}")
# Process text with and without stopwords removal# Tokenize the paragraph into sentencessentences = nltk.sent_tokenize(paragraph)# Initialize lists for processed sentenceswith_stopwords = []without_stopwords = []# Process each sentencefor sentence in sentences[:3]: # Process first 3 sentences for brevity words = nltk.word_tokenize(sentence)# Keep all words with_stopwords.append(' '.join(words))# Remove stopwords filtered_words = [word for word in words if word.lower() notin stop_words] without_stopwords.append(' '.join(filtered_words))# Create DataFrame for comparisonstopword_df = pd.DataFrame({'Original with Stopwords': with_stopwords,'After Stopwords Removal': without_stopwords})stopword_df
Original with Stopwords
After Stopwords Removal
0
On July 16 , 1969 , the Apollo 11 spacecraft l...
July 16 , 1969 , Apollo 11 spacecraft launched...
1
Its mission was to go where no human being had...
mission go human gone before—the moon !
2
The crew consisted of Neil Armstrong , Michael...
crew consisted Neil Armstrong , Michael Collin...
Complete Text Processing Pipeline
Let’s put everything together to create a complete text processing pipeline that includes tokenization, stopword removal, and either stemming or lemmatization.
def process_text(text, use_stemming=True, use_lemmatization=False):"""Process text using a complete NLP pipeline Args: text (str): Input text to process use_stemming (bool): Whether to apply stemming use_lemmatization (bool): Whether to apply lemmatization Returns: list: List of processed sentences """# Tokenize into sentences sentences = nltk.sent_tokenize(text) processed_sentences = []for sentence in sentences:# Tokenize into words words = nltk.word_tokenize(sentence)# Remove stopwords filtered_words = [word.lower() for word in words if word.lower() notin stopwords.words('english') and word.isalnum()]# Apply stemming or lemmatizationif use_stemming: processed_words = [snowball_stemmer.stem(word) for word in filtered_words]elif use_lemmatization: processed_words = [lemmatizer.lemmatize(word, pos='v') for word in filtered_words]else: processed_words = filtered_words processed_sentences.append(' '.join(processed_words))return processed_sentences# Process the paragraphstemmed_text = process_text(paragraph, use_stemming=True, use_lemmatization=False)lemmatized_text = process_text(paragraph, use_stemming=False, use_lemmatization=True)# Display first 3 processed sentencesfor i, (stem, lemma) inenumerate(zip(stemmed_text[:3], lemmatized_text[:3])):print(f"Sentence {i+1}:")print(f" Stemmed: {stem}")print(f" Lemmatized: {lemma}")print()
Sentence 1:
Stemmed: juli 16 1969 apollo 11 spacecraft launch kennedi space center florida
Lemmatized: july 16 1969 apollo 11 spacecraft launch kennedy space center florida
Sentence 2:
Stemmed: mission go human gone moon
Lemmatized: mission go human go moon
Sentence 3:
Stemmed: crew consist neil armstrong michael collin buzz aldrin
Lemmatized: crew consist neil armstrong michael collins buzz aldrin
Part-of-Speech (POS) Tagging
POS tagging is the process of marking words in a text with their corresponding part of speech (noun, verb, adjective, etc.). It’s an essential step for many NLP applications.
# Common POS tags in NLTKpos_tags_info = {'CC': 'Coordinating conjunction','CD': 'Cardinal digit','DT': 'Determiner','EX': 'Existential there ("there is")','FW': 'Foreign word','IN': 'Preposition/subordinating conjunction','JJ': 'Adjective','JJR': 'Adjective, comparative ("bigger")','JJS': 'Adjective, superlative ("biggest")','LS': 'List marker','MD': 'Modal (could, will)','NN': 'Noun, singular','NNS': 'Noun plural','NNP': 'Proper noun, singular','NNPS': 'Proper noun, plural','PDT': 'Predeterminer','POS': 'Possessive ending','PRP': 'Personal pronoun (I, he, she)','PRP$': 'Possessive pronoun (my, his, hers)','RB': 'Adverb','RBR': 'Adverb, comparative','RBS': 'Adverb, superlative','RP': 'Particle','TO': 'to','UH': 'Interjection','VB': 'Verb, base form','VBD': 'Verb, past tense','VBG': 'Verb, gerund/present participle','VBN': 'Verb, past participle','VBP': 'Verb, sing. present, non-3d','VBZ': 'Verb, 3rd person sing. present','WDT': 'Wh-determiner (which)','WP': 'Wh-pronoun (who, what)','WP$': 'Possessive wh-pronoun (whose)','WRB': 'Wh-adverb (where, when)'}# Display POS tag information as a tablepos_df = pd.DataFrame([(tag, desc) for tag, desc in pos_tags_info.items()], columns=['Tag', 'Description'])pos_df
Tag
Description
0
CC
Coordinating conjunction
1
CD
Cardinal digit
2
DT
Determiner
3
EX
Existential there ("there is")
4
FW
Foreign word
5
IN
Preposition/subordinating conjunction
6
JJ
Adjective
7
JJR
Adjective, comparative ("bigger")
8
JJS
Adjective, superlative ("biggest")
9
LS
List marker
10
MD
Modal (could, will)
11
NN
Noun, singular
12
NNS
Noun plural
13
NNP
Proper noun, singular
14
NNPS
Proper noun, plural
15
PDT
Predeterminer
16
POS
Possessive ending
17
PRP
Personal pronoun (I, he, she)
18
PRP$
Possessive pronoun (my, his, hers)
19
RB
Adverb
20
RBR
Adverb, comparative
21
RBS
Adverb, superlative
22
RP
Particle
23
TO
to
24
UH
Interjection
25
VB
Verb, base form
26
VBD
Verb, past tense
27
VBG
Verb, gerund/present participle
28
VBN
Verb, past participle
29
VBP
Verb, sing. present, non-3d
30
VBZ
Verb, 3rd person sing. present
31
WDT
Wh-determiner (which)
32
WP
Wh-pronoun (who, what)
33
WP$
Possessive wh-pronoun (whose)
34
WRB
Wh-adverb (where, when)
# Example sentences for POS taggingexample_sentences = ["The quick brown fox jumps over the lazy dog.","I am studying natural language processing.","She walked to the store, but it was closed."]# Perform POS taggingfor i, sentence inenumerate(example_sentences, 1):# Tokenize and tag words words = nltk.word_tokenize(sentence) tagged = nltk.pos_tag(words)# Create a visualization of the tagged sentenceprint(f"Sentence {i}: {sentence}")# Display tagged words in a table format tagged_df = pd.DataFrame(tagged, columns=['Word', 'POS Tag']) tagged_df['Description'] = tagged_df['POS Tag'].map(lambda tag: pos_tags_info.get(tag, 'Unknown')) display(tagged_df)print("\n")
Sentence 1: The quick brown fox jumps over the lazy dog.
Word
POS Tag
Description
0
The
DT
Determiner
1
quick
JJ
Adjective
2
brown
NN
Noun, singular
3
fox
NN
Noun, singular
4
jumps
VBZ
Verb, 3rd person sing. present
5
over
IN
Preposition/subordinating conjunction
6
the
DT
Determiner
7
lazy
JJ
Adjective
8
dog
NN
Noun, singular
9
.
.
Unknown
Sentence 2: I am studying natural language processing.
Word
POS Tag
Description
0
I
PRP
Personal pronoun (I, he, she)
1
am
VBP
Verb, sing. present, non-3d
2
studying
VBG
Verb, gerund/present participle
3
natural
JJ
Adjective
4
language
NN
Noun, singular
5
processing
NN
Noun, singular
6
.
.
Unknown
Sentence 3: She walked to the store, but it was closed.
Word
POS Tag
Description
0
She
PRP
Personal pronoun (I, he, she)
1
walked
VBD
Verb, past tense
2
to
TO
to
3
the
DT
Determiner
4
store
NN
Noun, singular
5
,
,
Unknown
6
but
CC
Coordinating conjunction
7
it
PRP
Personal pronoun (I, he, she)
8
was
VBD
Verb, past tense
9
closed
VBN
Verb, past participle
10
.
.
Unknown
Named Entity Recognition (NER)
Named Entity Recognition is the process of identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, time expressions, quantities, etc.
# Example sentences for NERner_examples = ["The Eiffel Tower stands on four lattice-girder piers that taper inward and join to form a single large vertical tower.","Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.","Barack Obama was born in Hawaii and served as the 44th president of the United States from 2009 to 2017."]
# Process examples with NERfor i, example inenumerate(ner_examples, 1):# Tokenize and tag words = nltk.word_tokenize(example) pos_tags = nltk.pos_tag(words)# Apply NER ner_tree = nltk.ne_chunk(pos_tags)print(f"Example {i}: {example}")print("\nNamed Entities:")# Extract and print named entities named_entities = []for chunk in ner_tree:ifhasattr(chunk, 'label'): entity_name =' '.join(c[0] for c in chunk) entity_type = chunk.label() named_entities.append((entity_name, entity_type))if named_entities: entities_df = pd.DataFrame(named_entities, columns=['Entity', 'Type']) display(entities_df)else:print("No named entities found")print("\n")
Example 1: The Eiffel Tower stands on four lattice-girder piers that taper inward and join to form a single large vertical tower.
Named Entities:
Entity
Type
0
Eiffel Tower
ORGANIZATION
Example 2: Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.
Named Entities:
Example 2: Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.
Named Entities:
Entity
Type
0
Apple
PERSON
1
Inc.
ORGANIZATION
2
Steve Jobs
PERSON
3
Steve Wozniak
PERSON
4
Ronald Wayne
PERSON
Example 3: Barack Obama was born in Hawaii and served as the 44th president of the United States from 2009 to 2017.
Named Entities:
Example 3: Barack Obama was born in Hawaii and served as the 44th president of the United States from 2009 to 2017.
Named Entities:
Entity
Type
0
Barack
PERSON
1
Obama
PERSON
2
Hawaii
GPE
3
United States
GPE
# Visualize NER Tree (if svgling is installed)try:import svgling# Use the second example for visualization example = ner_examples[1] words = nltk.word_tokenize(example) pos_tags = nltk.pos_tag(words) ner_tree = nltk.ne_chunk(pos_tags)print(f"Named Entity Tree for: {example}") svgling.draw_tree(ner_tree)exceptImportError:print("To visualize NER trees, install the 'svgling' package using: pip install svgling")# Alternative visualizationprint(ner_tree)
Named Entity Tree for: Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.
Text Analysis Dashboard
Let’s create a comprehensive analysis of text using various NLP techniques we’ve learned.
def analyze_text(text):"""Comprehensive text analysis using NLTK"""from collections import Counterimport re# Basic statistics sentences = nltk.sent_tokenize(text) words = nltk.word_tokenize(text) words_lower = [word.lower() for word in words if word.isalnum()] stop_words =set(stopwords.words('english')) words_no_stop = [word for word in words_lower if word notin stop_words]# Word frequency word_freq = Counter(words_no_stop) common_words = word_freq.most_common(10)# POS distribution pos_tags = nltk.pos_tag(words_lower) pos_counts = Counter([tag for _, tag in pos_tags])# Named entities ner_tree = nltk.ne_chunk(nltk.pos_tag(words)) named_entities = []for chunk in ner_tree:ifhasattr(chunk, 'label'): entity_name =' '.join(c[0] for c in chunk) entity_type = chunk.label() named_entities.append((entity_name, entity_type))# Print resultsprint("=== TEXT ANALYSIS DASHBOARD ===")print(f"Text length: {len(text)} characters")print(f"Sentences: {len(sentences)}")print(f"Words: {len(words_lower)}")print(f"Unique words: {len(set(words_lower))}")print(f"Words without stopwords: {len(words_no_stop)}")print("\n=== MOST COMMON WORDS ===")for word, count in common_words:print(f"{word}: {count}")print("\n=== PART OF SPEECH DISTRIBUTION ===")for pos, count in pos_counts.most_common(5):print(f"{pos} ({pos_tags_info.get(pos, 'Unknown')}): {count}")print("\n=== NAMED ENTITIES ===")if named_entities: entities_df = pd.DataFrame(named_entities, columns=['Entity', 'Type']) display(entities_df)else:print("No named entities found")# Generate word cloud if matplotlib is availabletry:from wordcloud import WordCloud wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(words_no_stop)) plt.figure(figsize=(10, 5)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.title('Word Cloud') plt.show()exceptImportError:print("\nInstall wordcloud package for word cloud visualization: pip install wordcloud")# Run analysis on the Apollo 11 paragraphanalyze_text(paragraph)
=== TEXT ANALYSIS DASHBOARD ===
Text length: 593 characters
Sentences: 7
Words: 100
Unique words: 70
Words without stopwords: 63
=== MOST COMMON WORDS ===
july: 3
1969: 3
moon: 3
spacecraft: 2
human: 2
neil: 2
armstrong: 2
took: 2
16: 1
apollo: 1
=== PART OF SPEECH DISTRIBUTION ===
NN (Noun, singular): 32
IN (Preposition/subordinating conjunction): 14
DT (Determiner): 13
VBD (Verb, past tense): 9
JJ (Adjective): 8
=== NAMED ENTITIES ===
Entity
Type
0
Kennedy Space Center
FACILITY
1
Florida
GPE
2
Neil Armstrong
PERSON
3
Michael Collins
PERSON
4
Buzz Aldrin
PERSON
5
Sea
ORGANIZATION
6
Tranquility
GPE
7
Commander Neil Armstrong
ORGANIZATION
8
Lunar Module
ORGANIZATION
Install wordcloud package for word cloud visualization: pip install wordcloud