Natural Language Processing with NLTK

Setup

First, let’s install and import NLTK and download the necessary resources.

! pip install nltk
Requirement already satisfied: nltk in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (3.9.1)
Requirement already satisfied: click in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from nltk) (8.1.8)
Requirement already satisfied: joblib in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from nltk) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from nltk) (2024.11.6)
Requirement already satisfied: tqdm in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from nltk) (4.67.1)
Requirement already satisfied: colorama in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from click->nltk) (0.4.6)

[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip
import nltk
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
True

Sample Text Data

Let’s create a sample corpus to work with throughout this notebook.

corpus = """This is a sample text corpus.
It contains multiple sentences.
The purpose of this corpus is to demonstrate text processing.
"""
# Display the corpus
print("Sample corpus:")
corpus
Sample corpus:
'This is a sample text corpus.\nIt contains multiple sentences.\nThe purpose of this corpus is to demonstrate text processing.\n'

Tokenization

Tokenization is the process of breaking down text into smaller units, such as sentences or words. This is typically the first step in any NLP pipeline.

Sentence Tokenization

Sentence tokenization splits a paragraph or document into individual sentences.

from nltk.tokenize import sent_tokenize

# Tokenize corpus into sentences
sentences = sent_tokenize(corpus)

print(f"Number of sentences: {len(sentences)}")
for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence}")
Number of sentences: 3
Sentence 1: This is a sample text corpus.
Sentence 2: It contains multiple sentences.
Sentence 3: The purpose of this corpus is to demonstrate text processing.
# Check type of document
print(f"Type of tokenized sentences: {type(sentences)}")
Type of tokenized sentences: <class 'list'>

Word Tokenization

Word tokenization splits sentences into individual words. NLTK offers several tokenizers with different behaviors.

# Word tokenization (sentence → words)
from nltk.tokenize import word_tokenize

# Tokenize each sentence into words
print("Word tokenization results:")
for i, sentence in enumerate(sentences, 1):
    words = word_tokenize(sentence)
    print(f"Sentence {i}: {words}")
Word tokenization results:
Sentence 1: ['This', 'is', 'a', 'sample', 'text', 'corpus', '.']
Sentence 2: ['It', 'contains', 'multiple', 'sentences', '.']
Sentence 3: ['The', 'purpose', 'of', 'this', 'corpus', 'is', 'to', 'demonstrate', 'text', 'processing', '.']

Comparing Different Tokenizers

NLTK provides various tokenizers, each with different rules and behaviors. Let’s compare them:

from nltk.tokenize import wordpunct_tokenize, TreebankWordTokenizer
import pandas as pd 
# Sample text for comparison
sample = "Don't hesitate to email me at john.doe@example.com or call at 555-123-4567!"

# Compare different tokenizers
tokenizers = {
    'word_tokenize': word_tokenize,
    'wordpunct_tokenize': wordpunct_tokenize,
    'TreebankWordTokenizer': TreebankWordTokenizer().tokenize
}

# Create a DataFrame to display results
results = {}
max_length = 0

# Tokenize and find the maximum length of tokenized results
for name, tokenizer in tokenizers.items():
    tokenized = tokenizer(sample)
    results[name] = tokenized
    max_length = max(max_length, len(tokenized))

# Pad tokenized results to make all arrays the same length
for name in results:
    results[name] += [None] * (max_length - len(results[name]))

# Display results as a DataFrame
pd.DataFrame(results).T
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
word_tokenize Do n't hesitate to email me at john.doe @ example.com ... ! None None None None None None None None None
wordpunct_tokenize Don ' t hesitate to email me at john . ... com or call at 555 - 123 - 4567 !
TreebankWordTokenizer Do n't hesitate to email me at john.doe @ example.com ... ! None None None None None None None None None

3 rows × 24 columns

Stemming

Stemming is the process of reducing words to their word stem or root form. It’s a rule-based process that chops off the ends of words to remove affixes. Stemming is useful for text normalization, but often produces non-dictionary words.

Comparing Different Stemmers

from nltk.stem import PorterStemmer, RegexpStemmer, SnowballStemmer

# Create sample words to compare stemmers
words = ["fearly", "running", "ran", "easily", "fairness", "eating", "eats", "eater", "eat", "history", "historical", "congratulations", "sliding", "comfortable"]
import pandas as pd 
# Initialize stemmers
porter_stemmer = PorterStemmer()
regexp_stemmer = RegexpStemmer('ing$|s$|e$|able', min=4)
snowball_stemmer = SnowballStemmer(language="english")

# Create comparison table
stemming_results = {
    'Original': words,
    'Porter': [porter_stemmer.stem(word) for word in words],
    'RegExp': [regexp_stemmer.stem(word) for word in words],
    'Snowball': [snowball_stemmer.stem(word) for word in words]
}

# Display results
stemming_df = pd.DataFrame(stemming_results)
stemming_df
Original Porter RegExp Snowball
0 fearly fearli fearly fear
1 running run runn run
2 ran ran ran ran
3 easily easili easily easili
4 fairness fair fairnes fair
5 eating eat eat eat
6 eats eat eat eat
7 eater eater eater eater
8 eat eat eat eat
9 history histori history histori
10 historical histor historical histor
11 congratulations congratul congratulation congratul
12 sliding slide slid slide
13 comfortable comfort comfort comfort

Stemming Analysis

As you can see from the results:

  1. Porter Stemmer: One of the oldest and simplest stemmers, it applies a set of rules to remove suffixes.
  2. RegExp Stemmer: Uses regular expressions to strip specified patterns from the end of words. It’s simple but less comprehensive.
  3. Snowball Stemmer: An improved version of the Porter algorithm, also known as Porter2, offering better accuracy for English and support for multiple languages.

Notice how stemming can sometimes produce non-dictionary words (e.g., “histori” for “history”). This is one of the main drawbacks of stemming compared to lemmatization.

Lemmatization

Lemmatization is similar to stemming, but it reduces words to their dictionary form (lemma) rather than just chopping off affixes. It considers the morphological analysis of the words and produces actual dictionary words.

Lemmatization is often preferred for applications like chatbots, Q&A systems, and text summarization because it preserves the meaning of words.

from nltk.stem import WordNetLemmatizer

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()
import pandas as pd
# Lemmatization with different POS (Part-of-Speech) tags
# POS tags: n-noun, v-verb, a-adjective, r-adverb
pos_tags = {'n': 'noun', 'v': 'verb', 'a': 'adjective', 'r': 'adverb'}

# Test words for lemmatization
lemma_words = ["running", "ran", "better", "studies", "studied", "feet", "children", "geese", "mice", "are", "is", "was", "fairly"]

# Create comparison table for different POS tags
lemma_results = {'Original': lemma_words}

for pos_tag, name in pos_tags.items():
    lemma_results[f'Lemma ({name})'] = [lemmatizer.lemmatize(word, pos=pos_tag) for word in lemma_words]

# Display results
pd.DataFrame(lemma_results)
Original Lemma (noun) Lemma (verb) Lemma (adjective) Lemma (adverb)
0 running running run running running
1 ran ran run ran ran
2 better better better good well
3 studies study study studies studies
4 studied studied study studied studied
5 feet foot feet feet feet
6 children child children children children
7 geese goose geese geese geese
8 mice mouse mice mice mice
9 are are be are are
10 is is be is is
11 was wa be was was
12 fairly fairly fairly fairly fairly

Stemming vs. Lemmatization

Let’s compare stemming and lemmatization side by side to see the differences:

# Compare stemming vs lemmatization
compare_words = ["running", "better", "studies", "feet", "wolves", "are", "historically"]

comparison = {
    'Original': compare_words,
    'Porter Stemmer': [porter_stemmer.stem(word) for word in compare_words],
    'Snowball Stemmer': [snowball_stemmer.stem(word) for word in compare_words],
    'Lemmatization (verb)': [lemmatizer.lemmatize(word, pos='v') for word in compare_words],
    'Lemmatization (noun)': [lemmatizer.lemmatize(word, pos='n') for word in compare_words]
}

pd.DataFrame(comparison)
Original Porter Stemmer Snowball Stemmer Lemmatization (verb) Lemmatization (noun)
0 running run run run running
1 better better better better better
2 studies studi studi study study
3 feet feet feet feet foot
4 wolves wolv wolv wolves wolf
5 are are are be are
6 historically histor histor historically historically

Stopword Removal

Stopwords are common words like “the”, “a”, “an”, “in” that usually don’t carry much meaning in text analysis. Removing them can help reduce noise in text processing.

# Sample paragraph for stopword removal
paragraph = """On July 16, 1969, the Apollo 11 spacecraft launched from the Kennedy Space Center in Florida. Its mission was to go where no human being had gone before—the moon! The crew consisted of Neil Armstrong, Michael Collins, and Buzz Aldrin. The spacecraft landed on the moon in the Sea of Tranquility, a basaltic flood plain, on July 20, 1969. The moonwalk took place the following day. On July 21, 1969, at precisely 10:56 EDT, Commander Neil Armstrong emerged from the Lunar Module and took his famous first step onto the moon's surface. He declared, 
 It was a monumental moment in human history!"""
from nltk.corpus import stopwords

# Get English stopwords
stop_words = stopwords.words('english')

# Display first 20 stopwords
print(f"Total English stopwords: {len(stop_words)}")
print(f"Sample stopwords: {stop_words[:20]}")
Total English stopwords: 198
Sample stopwords: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been']
# Process text with and without stopwords removal
# Tokenize the paragraph into sentences
sentences = nltk.sent_tokenize(paragraph)

# Initialize lists for processed sentences
with_stopwords = []
without_stopwords = []

# Process each sentence
for sentence in sentences[:3]:  # Process first 3 sentences for brevity
    words = nltk.word_tokenize(sentence)
    
    # Keep all words
    with_stopwords.append(' '.join(words))
    
    # Remove stopwords
    filtered_words = [word for word in words if word.lower() not in stop_words]
    without_stopwords.append(' '.join(filtered_words))

# Create DataFrame for comparison
stopword_df = pd.DataFrame({
    'Original with Stopwords': with_stopwords,
    'After Stopwords Removal': without_stopwords
})

stopword_df
Original with Stopwords After Stopwords Removal
0 On July 16 , 1969 , the Apollo 11 spacecraft l... July 16 , 1969 , Apollo 11 spacecraft launched...
1 Its mission was to go where no human being had... mission go human gone before—the moon !
2 The crew consisted of Neil Armstrong , Michael... crew consisted Neil Armstrong , Michael Collin...

Complete Text Processing Pipeline

Let’s put everything together to create a complete text processing pipeline that includes tokenization, stopword removal, and either stemming or lemmatization.

def process_text(text, use_stemming=True, use_lemmatization=False):
    """Process text using a complete NLP pipeline
    
    Args:
        text (str): Input text to process
        use_stemming (bool): Whether to apply stemming
        use_lemmatization (bool): Whether to apply lemmatization
        
    Returns:
        list: List of processed sentences
    """
    # Tokenize into sentences
    sentences = nltk.sent_tokenize(text)
    processed_sentences = []
    
    for sentence in sentences:
        # Tokenize into words
        words = nltk.word_tokenize(sentence)
        
        # Remove stopwords
        filtered_words = [word.lower() for word in words if word.lower() not in stopwords.words('english') and word.isalnum()]
        
        # Apply stemming or lemmatization
        if use_stemming:
            processed_words = [snowball_stemmer.stem(word) for word in filtered_words]
        elif use_lemmatization:
            processed_words = [lemmatizer.lemmatize(word, pos='v') for word in filtered_words]
        else:
            processed_words = filtered_words
            
        processed_sentences.append(' '.join(processed_words))
        
    return processed_sentences

# Process the paragraph
stemmed_text = process_text(paragraph, use_stemming=True, use_lemmatization=False)
lemmatized_text = process_text(paragraph, use_stemming=False, use_lemmatization=True)

# Display first 3 processed sentences
for i, (stem, lemma) in enumerate(zip(stemmed_text[:3], lemmatized_text[:3])):
    print(f"Sentence {i+1}:")
    print(f"  Stemmed: {stem}")
    print(f"  Lemmatized: {lemma}")
    print()
Sentence 1:
  Stemmed: juli 16 1969 apollo 11 spacecraft launch kennedi space center florida
  Lemmatized: july 16 1969 apollo 11 spacecraft launch kennedy space center florida

Sentence 2:
  Stemmed: mission go human gone moon
  Lemmatized: mission go human go moon

Sentence 3:
  Stemmed: crew consist neil armstrong michael collin buzz aldrin
  Lemmatized: crew consist neil armstrong michael collins buzz aldrin

Part-of-Speech (POS) Tagging

POS tagging is the process of marking words in a text with their corresponding part of speech (noun, verb, adjective, etc.). It’s an essential step for many NLP applications.

# Common POS tags in NLTK
pos_tags_info = {
    'CC': 'Coordinating conjunction',
    'CD': 'Cardinal digit',
    'DT': 'Determiner',
    'EX': 'Existential there ("there is")',
    'FW': 'Foreign word',
    'IN': 'Preposition/subordinating conjunction',
    'JJ': 'Adjective',
    'JJR': 'Adjective, comparative ("bigger")',
    'JJS': 'Adjective, superlative ("biggest")',
    'LS': 'List marker',
    'MD': 'Modal (could, will)',
    'NN': 'Noun, singular',
    'NNS': 'Noun plural',
    'NNP': 'Proper noun, singular',
    'NNPS': 'Proper noun, plural',
    'PDT': 'Predeterminer',
    'POS': 'Possessive ending',
    'PRP': 'Personal pronoun (I, he, she)',
    'PRP$': 'Possessive pronoun (my, his, hers)',
    'RB': 'Adverb',
    'RBR': 'Adverb, comparative',
    'RBS': 'Adverb, superlative',
    'RP': 'Particle',
    'TO': 'to',
    'UH': 'Interjection',
    'VB': 'Verb, base form',
    'VBD': 'Verb, past tense',
    'VBG': 'Verb, gerund/present participle',
    'VBN': 'Verb, past participle',
    'VBP': 'Verb, sing. present, non-3d',
    'VBZ': 'Verb, 3rd person sing. present',
    'WDT': 'Wh-determiner (which)',
    'WP': 'Wh-pronoun (who, what)',
    'WP$': 'Possessive wh-pronoun (whose)',
    'WRB': 'Wh-adverb (where, when)'
}

# Display POS tag information as a table
pos_df = pd.DataFrame([(tag, desc) for tag, desc in pos_tags_info.items()], 
                      columns=['Tag', 'Description'])
pos_df
Tag Description
0 CC Coordinating conjunction
1 CD Cardinal digit
2 DT Determiner
3 EX Existential there ("there is")
4 FW Foreign word
5 IN Preposition/subordinating conjunction
6 JJ Adjective
7 JJR Adjective, comparative ("bigger")
8 JJS Adjective, superlative ("biggest")
9 LS List marker
10 MD Modal (could, will)
11 NN Noun, singular
12 NNS Noun plural
13 NNP Proper noun, singular
14 NNPS Proper noun, plural
15 PDT Predeterminer
16 POS Possessive ending
17 PRP Personal pronoun (I, he, she)
18 PRP$ Possessive pronoun (my, his, hers)
19 RB Adverb
20 RBR Adverb, comparative
21 RBS Adverb, superlative
22 RP Particle
23 TO to
24 UH Interjection
25 VB Verb, base form
26 VBD Verb, past tense
27 VBG Verb, gerund/present participle
28 VBN Verb, past participle
29 VBP Verb, sing. present, non-3d
30 VBZ Verb, 3rd person sing. present
31 WDT Wh-determiner (which)
32 WP Wh-pronoun (who, what)
33 WP$ Possessive wh-pronoun (whose)
34 WRB Wh-adverb (where, when)
# Example sentences for POS tagging
example_sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "I am studying natural language processing.",
    "She walked to the store, but it was closed."
]

# Perform POS tagging
for i, sentence in enumerate(example_sentences, 1):
    # Tokenize and tag words
    words = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(words)
    
    # Create a visualization of the tagged sentence
    print(f"Sentence {i}: {sentence}")
    
    # Display tagged words in a table format
    tagged_df = pd.DataFrame(tagged, columns=['Word', 'POS Tag'])
    tagged_df['Description'] = tagged_df['POS Tag'].map(lambda tag: pos_tags_info.get(tag, 'Unknown'))
    display(tagged_df)
    print("\n")
Sentence 1: The quick brown fox jumps over the lazy dog.
Word POS Tag Description
0 The DT Determiner
1 quick JJ Adjective
2 brown NN Noun, singular
3 fox NN Noun, singular
4 jumps VBZ Verb, 3rd person sing. present
5 over IN Preposition/subordinating conjunction
6 the DT Determiner
7 lazy JJ Adjective
8 dog NN Noun, singular
9 . . Unknown


Sentence 2: I am studying natural language processing.
Word POS Tag Description
0 I PRP Personal pronoun (I, he, she)
1 am VBP Verb, sing. present, non-3d
2 studying VBG Verb, gerund/present participle
3 natural JJ Adjective
4 language NN Noun, singular
5 processing NN Noun, singular
6 . . Unknown


Sentence 3: She walked to the store, but it was closed.
Word POS Tag Description
0 She PRP Personal pronoun (I, he, she)
1 walked VBD Verb, past tense
2 to TO to
3 the DT Determiner
4 store NN Noun, singular
5 , , Unknown
6 but CC Coordinating conjunction
7 it PRP Personal pronoun (I, he, she)
8 was VBD Verb, past tense
9 closed VBN Verb, past participle
10 . . Unknown

Named Entity Recognition (NER)

Named Entity Recognition is the process of identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, time expressions, quantities, etc.

# Example sentences for NER
ner_examples = [
    "The Eiffel Tower stands on four lattice-girder piers that taper inward and join to form a single large vertical tower.",
    "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.",
    "Barack Obama was born in Hawaii and served as the 44th president of the United States from 2009 to 2017."
]
# Process examples with NER
for i, example in enumerate(ner_examples, 1):
    # Tokenize and tag
    words = nltk.word_tokenize(example)
    pos_tags = nltk.pos_tag(words)
    
    # Apply NER
    ner_tree = nltk.ne_chunk(pos_tags)
    
    print(f"Example {i}: {example}")
    print("\nNamed Entities:")
    
    # Extract and print named entities
    named_entities = []
    for chunk in ner_tree:
        if hasattr(chunk, 'label'):
            entity_name = ' '.join(c[0] for c in chunk)
            entity_type = chunk.label()
            named_entities.append((entity_name, entity_type))
    
    if named_entities:
        entities_df = pd.DataFrame(named_entities, columns=['Entity', 'Type'])
        display(entities_df)
    else:
        print("No named entities found")
    
    print("\n")
Example 1: The Eiffel Tower stands on four lattice-girder piers that taper inward and join to form a single large vertical tower.

Named Entities:
Entity Type
0 Eiffel Tower ORGANIZATION


Example 2: Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.

Named Entities:
Example 2: Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.

Named Entities:
Entity Type
0 Apple PERSON
1 Inc. ORGANIZATION
2 Steve Jobs PERSON
3 Steve Wozniak PERSON
4 Ronald Wayne PERSON


Example 3: Barack Obama was born in Hawaii and served as the 44th president of the United States from 2009 to 2017.

Named Entities:
Example 3: Barack Obama was born in Hawaii and served as the 44th president of the United States from 2009 to 2017.

Named Entities:
Entity Type
0 Barack PERSON
1 Obama PERSON
2 Hawaii GPE
3 United States GPE

# Visualize NER Tree (if svgling is installed)
try:
    import svgling
    
    # Use the second example for visualization
    example = ner_examples[1]
    words = nltk.word_tokenize(example)
    pos_tags = nltk.pos_tag(words)
    ner_tree = nltk.ne_chunk(pos_tags)
    
    print(f"Named Entity Tree for: {example}")
    svgling.draw_tree(ner_tree)
except ImportError:
    print("To visualize NER trees, install the 'svgling' package using: pip install svgling")
    # Alternative visualization
    print(ner_tree)
Named Entity Tree for: Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.

Text Analysis Dashboard

Let’s create a comprehensive analysis of text using various NLP techniques we’ve learned.

def analyze_text(text):
    """Comprehensive text analysis using NLTK"""
    from collections import Counter
    import re
    
    # Basic statistics
    sentences = nltk.sent_tokenize(text)
    words = nltk.word_tokenize(text)
    words_lower = [word.lower() for word in words if word.isalnum()]
    stop_words = set(stopwords.words('english'))
    words_no_stop = [word for word in words_lower if word not in stop_words]
    
    # Word frequency
    word_freq = Counter(words_no_stop)
    common_words = word_freq.most_common(10)
    
    # POS distribution
    pos_tags = nltk.pos_tag(words_lower)
    pos_counts = Counter([tag for _, tag in pos_tags])
    
    # Named entities
    ner_tree = nltk.ne_chunk(nltk.pos_tag(words))
    named_entities = []
    for chunk in ner_tree:
        if hasattr(chunk, 'label'):
            entity_name = ' '.join(c[0] for c in chunk)
            entity_type = chunk.label()
            named_entities.append((entity_name, entity_type))
    
    # Print results
    print("=== TEXT ANALYSIS DASHBOARD ===")
    print(f"Text length: {len(text)} characters")
    print(f"Sentences: {len(sentences)}")
    print(f"Words: {len(words_lower)}")
    print(f"Unique words: {len(set(words_lower))}")
    print(f"Words without stopwords: {len(words_no_stop)}")
    
    print("\n=== MOST COMMON WORDS ===")
    for word, count in common_words:
        print(f"{word}: {count}")
    
    print("\n=== PART OF SPEECH DISTRIBUTION ===")
    for pos, count in pos_counts.most_common(5):
        print(f"{pos} ({pos_tags_info.get(pos, 'Unknown')}): {count}")
    
    print("\n=== NAMED ENTITIES ===")
    if named_entities:
        entities_df = pd.DataFrame(named_entities, columns=['Entity', 'Type'])
        display(entities_df)
    else:
        print("No named entities found")
    
    # Generate word cloud if matplotlib is available
    try:
        from wordcloud import WordCloud
        
        wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(words_no_stop))
        
        plt.figure(figsize=(10, 5))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis('off')
        plt.title('Word Cloud')
        plt.show()
    except ImportError:
        print("\nInstall wordcloud package for word cloud visualization: pip install wordcloud")

# Run analysis on the Apollo 11 paragraph
analyze_text(paragraph)
=== TEXT ANALYSIS DASHBOARD ===
Text length: 593 characters
Sentences: 7
Words: 100
Unique words: 70
Words without stopwords: 63

=== MOST COMMON WORDS ===
july: 3
1969: 3
moon: 3
spacecraft: 2
human: 2
neil: 2
armstrong: 2
took: 2
16: 1
apollo: 1

=== PART OF SPEECH DISTRIBUTION ===
NN (Noun, singular): 32
IN (Preposition/subordinating conjunction): 14
DT (Determiner): 13
VBD (Verb, past tense): 9
JJ (Adjective): 8

=== NAMED ENTITIES ===
Entity Type
0 Kennedy Space Center FACILITY
1 Florida GPE
2 Neil Armstrong PERSON
3 Michael Collins PERSON
4 Buzz Aldrin PERSON
5 Sea ORGANIZATION
6 Tranquility GPE
7 Commander Neil Armstrong ORGANIZATION
8 Lunar Module ORGANIZATION

Install wordcloud package for word cloud visualization: pip install wordcloud