Natural Language Processing with NLTK

Setup

First, let’s install and import NLTK and download the necessary resources.

! pip install nltk

Requirement already satisfied: nltk in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (3.9.1)
Requirement already satisfied: click in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from nltk) (8.1.8)
Requirement already satisfied: joblib in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from nltk) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from nltk) (2024.11.6)
Requirement already satisfied: tqdm in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from nltk) (4.67.1)
Requirement already satisfied: colorama in c:\users\fevzikilas\desktop\nlp\nlp-l\lib\site-packages (from click->nltk) (0.4.6)


[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip

import nltk
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\fevzikilas\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!

True

Sample Text Data

Let’s create a sample corpus to work with throughout this notebook.

corpus = """This is a sample text corpus.
It contains multiple sentences.
The purpose of this corpus is to demonstrate text processing.
"""

# Display the corpus
print("Sample corpus:")
corpus

Sample corpus:

'This is a sample text corpus.\nIt contains multiple sentences.\nThe purpose of this corpus is to demonstrate text processing.\n'

Tokenization

Tokenization is the process of breaking down text into smaller units, such as sentences or words. This is typically the first step in any NLP pipeline.

Sentence Tokenization

Sentence tokenization splits a paragraph or document into individual sentences.

from nltk.tokenize import sent_tokenize

# Tokenize corpus into sentences
sentences = sent_tokenize(corpus)

print(f"Number of sentences: {len(sentences)}")
for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence}")

Number of sentences: 3
Sentence 1: This is a sample text corpus.
Sentence 2: It contains multiple sentences.
Sentence 3: The purpose of this corpus is to demonstrate text processing.

# Check type of document
print(f"Type of tokenized sentences: {type(sentences)}")

Type of tokenized sentences: <class 'list'>

Word Tokenization

Word tokenization splits sentences into individual words. NLTK offers several tokenizers with different behaviors.

# Word tokenization (sentence → words)
from nltk.tokenize import word_tokenize

# Tokenize each sentence into words
print("Word tokenization results:")
for i, sentence in enumerate(sentences, 1):
    words = word_tokenize(sentence)
    print(f"Sentence {i}: {words}")

Word tokenization results:
Sentence 1: ['This', 'is', 'a', 'sample', 'text', 'corpus', '.']
Sentence 2: ['It', 'contains', 'multiple', 'sentences', '.']
Sentence 3: ['The', 'purpose', 'of', 'this', 'corpus', 'is', 'to', 'demonstrate', 'text', 'processing', '.']

Comparing Different Tokenizers

NLTK provides various tokenizers, each with different rules and behaviors. Let’s compare them:

from nltk.tokenize import wordpunct_tokenize, TreebankWordTokenizer
import pandas as pd 
# Sample text for comparison
sample = "Don't hesitate to email me at john.doe@example.com or call at 555-123-4567!"

# Compare different tokenizers
tokenizers = {
    'word_tokenize': word_tokenize,
    'wordpunct_tokenize': wordpunct_tokenize,
    'TreebankWordTokenizer': TreebankWordTokenizer().tokenize
}

# Create a DataFrame to display results
results = {}
max_length = 0

# Tokenize and find the maximum length of tokenized results
for name, tokenizer in tokenizers.items():
    tokenized = tokenizer(sample)
    results[name] = tokenized
    max_length = max(max_length, len(tokenized))

# Pad tokenized results to make all arrays the same length
for name in results:
    results[name] += [None] * (max_length - len(results[name]))

# Display results as a DataFrame
pd.DataFrame(results).T

	0	1	2	3	4	5	6	7	8	9	...	14	15	16	17	18	19	20	21	22	23
word_tokenize	Do	n't	hesitate	to	email	me	at	john.doe	@	example.com	...	!	None	None	None	None	None	None	None	None	None
wordpunct_tokenize	Don	'	t	hesitate	to	email	me	at	john	.	...	com	or	call	at	555	-	123	-	4567	!
TreebankWordTokenizer	Do	n't	hesitate	to	email	me	at	john.doe	@	example.com	...	!	None	None	None	None	None	None	None	None	None

3 rows × 24 columns

Stemming

Stemming is the process of reducing words to their word stem or root form. It’s a rule-based process that chops off the ends of words to remove affixes. Stemming is useful for text normalization, but often produces non-dictionary words.

Comparing Different Stemmers

from nltk.stem import PorterStemmer, RegexpStemmer, SnowballStemmer

# Create sample words to compare stemmers
words = ["fearly", "running", "ran", "easily", "fairness", "eating", "eats", "eater", "eat", "history", "historical", "congratulations", "sliding", "comfortable"]

import pandas as pd 
# Initialize stemmers
porter_stemmer = PorterStemmer()
regexp_stemmer = RegexpStemmer('ing$|s$|e$|able', min=4)
snowball_stemmer = SnowballStemmer(language="english")

# Create comparison table
stemming_results = {
    'Original': words,
    'Porter': [porter_stemmer.stem(word) for word in words],
    'RegExp': [regexp_stemmer.stem(word) for word in words],
    'Snowball': [snowball_stemmer.stem(word) for word in words]
}

# Display results
stemming_df = pd.DataFrame(stemming_results)
stemming_df

	Original	Porter	RegExp	Snowball
0	fearly	fearli	fearly	fear
1	running	run	runn	run
2	ran	ran	ran	ran
3	easily	easili	easily	easili
4	fairness	fair	fairnes	fair
5	eating	eat	eat	eat
6	eats	eat	eat	eat
7	eater	eater	eater	eater
8	eat	eat	eat	eat
9	history	histori	history	histori
10	historical	histor	historical	histor
11	congratulations	congratul	congratulation	congratul
12	sliding	slide	slid	slide
13	comfortable	comfort	comfort	comfort

Stemming Analysis

As you can see from the results:

Porter Stemmer: One of the oldest and simplest stemmers, it applies a set of rules to remove suffixes.
RegExp Stemmer: Uses regular expressions to strip specified patterns from the end of words. It’s simple but less comprehensive.
Snowball Stemmer: An improved version of the Porter algorithm, also known as Porter2, offering better accuracy for English and support for multiple languages.

Notice how stemming can sometimes produce non-dictionary words (e.g., “histori” for “history”). This is one of the main drawbacks of stemming compared to lemmatization.

Lemmatization

Lemmatization is similar to stemming, but it reduces words to their dictionary form (lemma) rather than just chopping off affixes. It considers the morphological analysis of the words and produces actual dictionary words.

Lemmatization is often preferred for applications like chatbots, Q&A systems, and text summarization because it preserves the meaning of words.

from nltk.stem import WordNetLemmatizer

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

import pandas as pd
# Lemmatization with different POS (Part-of-Speech) tags
# POS tags: n-noun, v-verb, a-adjective, r-adverb
pos_tags = {'n': 'noun', 'v': 'verb', 'a': 'adjective', 'r': 'adverb'}

# Test words for lemmatization
lemma_words = ["running", "ran", "better", "studies", "studied", "feet", "children", "geese", "mice", "are", "is", "was", "fairly"]

# Create comparison table for different POS tags
lemma_results = {'Original': lemma_words}

for pos_tag, name in pos_tags.items():
    lemma_results[f'Lemma ({name})'] = [lemmatizer.lemmatize(word, pos=pos_tag) for word in lemma_words]

# Display results
pd.DataFrame(lemma_results)

	Original	Lemma (noun)	Lemma (verb)	Lemma (adjective)	Lemma (adverb)
0	running	running	run	running	running
1	ran	ran	run	ran	ran
2	better	better	better	good	well
3	studies	study	study	studies	studies
4	studied	studied	study	studied	studied
5	feet	foot	feet	feet	feet
6	children	child	children	children	children
7	geese	goose	geese	geese	geese
8	mice	mouse	mice	mice	mice
9	are	are	be	are	are
10	is	is	be	is	is
11	was	wa	be	was	was
12	fairly	fairly	fairly	fairly	fairly

Stemming vs. Lemmatization

Let’s compare stemming and lemmatization side by side to see the differences:

# Compare stemming vs lemmatization
compare_words = ["running", "better", "studies", "feet", "wolves", "are", "historically"]

comparison = {
    'Original': compare_words,
    'Porter Stemmer': [porter_stemmer.stem(word) for word in compare_words],
    'Snowball Stemmer': [snowball_stemmer.stem(word) for word in compare_words],
    'Lemmatization (verb)': [lemmatizer.lemmatize(word, pos='v') for word in compare_words],
    'Lemmatization (noun)': [lemmatizer.lemmatize(word, pos='n') for word in compare_words]
}

pd.DataFrame(comparison)

	Original	Porter Stemmer	Snowball Stemmer	Lemmatization (verb)	Lemmatization (noun)
0	running	run	run	run	running
1	better	better	better	better	better
2	studies	studi	studi	study	study
3	feet	feet	feet	feet	foot
4	wolves	wolv	wolv	wolves	wolf
5	are	are	are	be	are
6	historically	histor	histor	historically	historically

Stopword Removal

Stopwords are common words like “the”, “a”, “an”, “in” that usually don’t carry much meaning in text analysis. Removing them can help reduce noise in text processing.

# Sample paragraph for stopword removal
paragraph = """On July 16, 1969, the Apollo 11 spacecraft launched from the Kennedy Space Center in Florida. Its mission was to go where no human being had gone before—the moon! The crew consisted of Neil Armstrong, Michael Collins, and Buzz Aldrin. The spacecraft landed on the moon in the Sea of Tranquility, a basaltic flood plain, on July 20, 1969. The moonwalk took place the following day. On July 21, 1969, at precisely 10:56 EDT, Commander Neil Armstrong emerged from the Lunar Module and took his famous first step onto the moon's surface. He declared, 
 It was a monumental moment in human history!"""

from nltk.corpus import stopwords

# Get English stopwords
stop_words = stopwords.words('english')

# Display first 20 stopwords
print(f"Total English stopwords: {len(stop_words)}")
print(f"Sample stopwords: {stop_words[:20]}")

Total English stopwords: 198
Sample stopwords: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been']

# Process text with and without stopwords removal
# Tokenize the paragraph into sentences
sentences = nltk.sent_tokenize(paragraph)

# Initialize lists for processed sentences
with_stopwords = []
without_stopwords = []

# Process each sentence
for sentence in sentences[:3]:  # Process first 3 sentences for brevity
    words = nltk.word_tokenize(sentence)
    
    # Keep all words
    with_stopwords.append(' '.join(words))
    
    # Remove stopwords
    filtered_words = [word for word in words if word.lower() not in stop_words]
    without_stopwords.append(' '.join(filtered_words))

# Create DataFrame for comparison
stopword_df = pd.DataFrame({
    'Original with Stopwords': with_stopwords,
    'After Stopwords Removal': without_stopwords
})

stopword_df

	Original with Stopwords	After Stopwords Removal
0	On July 16 , 1969 , the Apollo 11 spacecraft l...	July 16 , 1969 , Apollo 11 spacecraft launched...
1	Its mission was to go where no human being had...	mission go human gone before—the moon !
2	The crew consisted of Neil Armstrong , Michael...	crew consisted Neil Armstrong , Michael Collin...

Complete Text Processing Pipeline

Let’s put everything together to create a complete text processing pipeline that includes tokenization, stopword removal, and either stemming or lemmatization.

def process_text(text, use_stemming=True, use_lemmatization=False):
    """Process text using a complete NLP pipeline
    
    Args:
        text (str): Input text to process
        use_stemming (bool): Whether to apply stemming
        use_lemmatization (bool): Whether to apply lemmatization
        
    Returns:
        list: List of processed sentences
    """
    # Tokenize into sentences
    sentences = nltk.sent_tokenize(text)
    processed_sentences = []
    
    for sentence in sentences:
        # Tokenize into words
        words = nltk.word_tokenize(sentence)
        
        # Remove stopwords
        filtered_words = [word.lower() for word in words if word.lower() not in stopwords.words('english') and word.isalnum()]
        
        # Apply stemming or lemmatization
        if use_stemming:
            processed_words = [snowball_stemmer.stem(word) for word in filtered_words]
        elif use_lemmatization:
            processed_words = [lemmatizer.lemmatize(word, pos='v') for word in filtered_words]
        else:
            processed_words = filtered_words
            
        processed_sentences.append(' '.join(processed_words))
        
    return processed_sentences

# Process the paragraph
stemmed_text = process_text(paragraph, use_stemming=True, use_lemmatization=False)
lemmatized_text = process_text(paragraph, use_stemming=False, use_lemmatization=True)

# Display first 3 processed sentences
for i, (stem, lemma) in enumerate(zip(stemmed_text[:3], lemmatized_text[:3])):
    print(f"Sentence {i+1}:")
    print(f"  Stemmed: {stem}")
    print(f"  Lemmatized: {lemma}")
    print()

Sentence 1:
  Stemmed: juli 16 1969 apollo 11 spacecraft launch kennedi space center florida
  Lemmatized: july 16 1969 apollo 11 spacecraft launch kennedy space center florida

Sentence 2:
  Stemmed: mission go human gone moon
  Lemmatized: mission go human go moon

Sentence 3:
  Stemmed: crew consist neil armstrong michael collin buzz aldrin
  Lemmatized: crew consist neil armstrong michael collins buzz aldrin

Part-of-Speech (POS) Tagging

POS tagging is the process of marking words in a text with their corresponding part of speech (noun, verb, adjective, etc.). It’s an essential step for many NLP applications.

# Common POS tags in NLTK
pos_tags_info = {
    'CC': 'Coordinating conjunction',
    'CD': 'Cardinal digit',
    'DT': 'Determiner',
    'EX': 'Existential there ("there is")',
    'FW': 'Foreign word',
    'IN': 'Preposition/subordinating conjunction',
    'JJ': 'Adjective',
    'JJR': 'Adjective, comparative ("bigger")',
    'JJS': 'Adjective, superlative ("biggest")',
    'LS': 'List marker',
    'MD': 'Modal (could, will)',
    'NN': 'Noun, singular',
    'NNS': 'Noun plural',
    'NNP': 'Proper noun, singular',
    'NNPS': 'Proper noun, plural',
    'PDT': 'Predeterminer',
    'POS': 'Possessive ending',
    'PRP': 'Personal pronoun (I, he, she)',
    'PRP$': 'Possessive pronoun (my, his, hers)',
    'RB': 'Adverb',
    'RBR': 'Adverb, comparative',
    'RBS': 'Adverb, superlative',
    'RP': 'Particle',
    'TO': 'to',
    'UH': 'Interjection',
    'VB': 'Verb, base form',
    'VBD': 'Verb, past tense',
    'VBG': 'Verb, gerund/present participle',
    'VBN': 'Verb, past participle',
    'VBP': 'Verb, sing. present, non-3d',
    'VBZ': 'Verb, 3rd person sing. present',
    'WDT': 'Wh-determiner (which)',
    'WP': 'Wh-pronoun (who, what)',
    'WP$': 'Possessive wh-pronoun (whose)',
    'WRB': 'Wh-adverb (where, when)'
}

# Display POS tag information as a table
pos_df = pd.DataFrame([(tag, desc) for tag, desc in pos_tags_info.items()], 
                      columns=['Tag', 'Description'])
pos_df

	Tag	Description
0	CC	Coordinating conjunction
1	CD	Cardinal digit
2	DT	Determiner
3	EX	Existential there ("there is")
4	FW	Foreign word
5	IN	Preposition/subordinating conjunction
6	JJ	Adjective
7	JJR	Adjective, comparative ("bigger")
8	JJS	Adjective, superlative ("biggest")
9	LS	List marker
10	MD	Modal (could, will)
11	NN	Noun, singular
12	NNS	Noun plural
13	NNP	Proper noun, singular
14	NNPS	Proper noun, plural
15	PDT	Predeterminer
16	POS	Possessive ending
17	PRP	Personal pronoun (I, he, she)
18	PRP$	Possessive pronoun (my, his, hers)
19	RB	Adverb
20	RBR	Adverb, comparative
21	RBS	Adverb, superlative
22	RP	Particle
23	TO	to
24	UH	Interjection
25	VB	Verb, base form
26	VBD	Verb, past tense
27	VBG	Verb, gerund/present participle
28	VBN	Verb, past participle
29	VBP	Verb, sing. present, non-3d
30	VBZ	Verb, 3rd person sing. present
31	WDT	Wh-determiner (which)
32	WP	Wh-pronoun (who, what)
33	WP$	Possessive wh-pronoun (whose)
34	WRB	Wh-adverb (where, when)

# Example sentences for POS tagging
example_sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "I am studying natural language processing.",
    "She walked to the store, but it was closed."
]

# Perform POS tagging
for i, sentence in enumerate(example_sentences, 1):
    # Tokenize and tag words
    words = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(words)
    
    # Create a visualization of the tagged sentence
    print(f"Sentence {i}: {sentence}")
    
    # Display tagged words in a table format
    tagged_df = pd.DataFrame(tagged, columns=['Word', 'POS Tag'])
    tagged_df['Description'] = tagged_df['POS Tag'].map(lambda tag: pos_tags_info.get(tag, 'Unknown'))
    display(tagged_df)
    print("\n")

Sentence 1: The quick brown fox jumps over the lazy dog.

	Word	POS Tag	Description
0	The	DT	Determiner
1	quick	JJ	Adjective
2	brown	NN	Noun, singular
3	fox	NN	Noun, singular
4	jumps	VBZ	Verb, 3rd person sing. present
5	over	IN	Preposition/subordinating conjunction
6	the	DT	Determiner
7	lazy	JJ	Adjective
8	dog	NN	Noun, singular
9	.	.	Unknown



Sentence 2: I am studying natural language processing.

	Word	POS Tag	Description
0	I	PRP	Personal pronoun (I, he, she)
1	am	VBP	Verb, sing. present, non-3d
2	studying	VBG	Verb, gerund/present participle
3	natural	JJ	Adjective
4	language	NN	Noun, singular
5	processing	NN	Noun, singular
6	.	.	Unknown



Sentence 3: She walked to the store, but it was closed.

	Word	POS Tag	Description
0	She	PRP	Personal pronoun (I, he, she)
1	walked	VBD	Verb, past tense
2	to	TO	to
3	the	DT	Determiner
4	store	NN	Noun, singular
5	,	,	Unknown
6	but	CC	Coordinating conjunction
7	it	PRP	Personal pronoun (I, he, she)
8	was	VBD	Verb, past tense
9	closed	VBN	Verb, past participle
10	.	.	Unknown

Named Entity Recognition (NER)

Named Entity Recognition is the process of identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, time expressions, quantities, etc.

# Example sentences for NER
ner_examples = [
    "The Eiffel Tower stands on four lattice-girder piers that taper inward and join to form a single large vertical tower.",
    "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.",
    "Barack Obama was born in Hawaii and served as the 44th president of the United States from 2009 to 2017."
]

# Process examples with NER
for i, example in enumerate(ner_examples, 1):
    # Tokenize and tag
    words = nltk.word_tokenize(example)
    pos_tags = nltk.pos_tag(words)
    
    # Apply NER
    ner_tree = nltk.ne_chunk(pos_tags)
    
    print(f"Example {i}: {example}")
    print("\nNamed Entities:")
    
    # Extract and print named entities
    named_entities = []
    for chunk in ner_tree:
        if hasattr(chunk, 'label'):
            entity_name = ' '.join(c[0] for c in chunk)
            entity_type = chunk.label()
            named_entities.append((entity_name, entity_type))
    
    if named_entities:
        entities_df = pd.DataFrame(named_entities, columns=['Entity', 'Type'])
        display(entities_df)
    else:
        print("No named entities found")
    
    print("\n")

Example 1: The Eiffel Tower stands on four lattice-girder piers that taper inward and join to form a single large vertical tower.

Named Entities:

	Entity	Type
0	Eiffel Tower	ORGANIZATION



Example 2: Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.

Named Entities:
Example 2: Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.

Named Entities:

	Entity	Type
0	Apple	PERSON
1	Inc.	ORGANIZATION
2	Steve Jobs	PERSON
3	Steve Wozniak	PERSON
4	Ronald Wayne	PERSON



Example 3: Barack Obama was born in Hawaii and served as the 44th president of the United States from 2009 to 2017.

Named Entities:
Example 3: Barack Obama was born in Hawaii and served as the 44th president of the United States from 2009 to 2017.

Named Entities:

	Entity	Type
0	Barack	PERSON
1	Obama	PERSON
2	Hawaii	GPE
3	United States	GPE

# Visualize NER Tree (if svgling is installed)
try:
    import svgling
    
    # Use the second example for visualization
    example = ner_examples[1]
    words = nltk.word_tokenize(example)
    pos_tags = nltk.pos_tag(words)
    ner_tree = nltk.ne_chunk(pos_tags)
    
    print(f"Named Entity Tree for: {example}")
    svgling.draw_tree(ner_tree)
except ImportError:
    print("To visualize NER trees, install the 'svgling' package using: pip install svgling")
    # Alternative visualization
    print(ner_tree)

Named Entity Tree for: Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.

Text Analysis Dashboard

Let’s create a comprehensive analysis of text using various NLP techniques we’ve learned.

def analyze_text(text):
    """Comprehensive text analysis using NLTK"""
    from collections import Counter
    import re
    
    # Basic statistics
    sentences = nltk.sent_tokenize(text)
    words = nltk.word_tokenize(text)
    words_lower = [word.lower() for word in words if word.isalnum()]
    stop_words = set(stopwords.words('english'))
    words_no_stop = [word for word in words_lower if word not in stop_words]
    
    # Word frequency
    word_freq = Counter(words_no_stop)
    common_words = word_freq.most_common(10)
    
    # POS distribution
    pos_tags = nltk.pos_tag(words_lower)
    pos_counts = Counter([tag for _, tag in pos_tags])
    
    # Named entities
    ner_tree = nltk.ne_chunk(nltk.pos_tag(words))
    named_entities = []
    for chunk in ner_tree:
        if hasattr(chunk, 'label'):
            entity_name = ' '.join(c[0] for c in chunk)
            entity_type = chunk.label()
            named_entities.append((entity_name, entity_type))
    
    # Print results
    print("=== TEXT ANALYSIS DASHBOARD ===")
    print(f"Text length: {len(text)} characters")
    print(f"Sentences: {len(sentences)}")
    print(f"Words: {len(words_lower)}")
    print(f"Unique words: {len(set(words_lower))}")
    print(f"Words without stopwords: {len(words_no_stop)}")
    
    print("\n=== MOST COMMON WORDS ===")
    for word, count in common_words:
        print(f"{word}: {count}")
    
    print("\n=== PART OF SPEECH DISTRIBUTION ===")
    for pos, count in pos_counts.most_common(5):
        print(f"{pos} ({pos_tags_info.get(pos, 'Unknown')}): {count}")
    
    print("\n=== NAMED ENTITIES ===")
    if named_entities:
        entities_df = pd.DataFrame(named_entities, columns=['Entity', 'Type'])
        display(entities_df)
    else:
        print("No named entities found")
    
    # Generate word cloud if matplotlib is available
    try:
        from wordcloud import WordCloud
        
        wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(words_no_stop))
        
        plt.figure(figsize=(10, 5))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis('off')
        plt.title('Word Cloud')
        plt.show()
    except ImportError:
        print("\nInstall wordcloud package for word cloud visualization: pip install wordcloud")

# Run analysis on the Apollo 11 paragraph
analyze_text(paragraph)

=== TEXT ANALYSIS DASHBOARD ===
Text length: 593 characters
Sentences: 7
Words: 100
Unique words: 70
Words without stopwords: 63

=== MOST COMMON WORDS ===
july: 3
1969: 3
moon: 3
spacecraft: 2
human: 2
neil: 2
armstrong: 2
took: 2
16: 1
apollo: 1

=== PART OF SPEECH DISTRIBUTION ===
NN (Noun, singular): 32
IN (Preposition/subordinating conjunction): 14
DT (Determiner): 13
VBD (Verb, past tense): 9
JJ (Adjective): 8

=== NAMED ENTITIES ===

	Entity	Type
0	Kennedy Space Center	FACILITY
1	Florida	GPE
2	Neil Armstrong	PERSON
3	Michael Collins	PERSON
4	Buzz Aldrin	PERSON
5	Sea	ORGANIZATION
6	Tranquility	GPE
7	Commander Neil Armstrong	ORGANIZATION
8	Lunar Module	ORGANIZATION


Install wordcloud package for word cloud visualization: pip install wordcloud