
You want to build a word dictionary, where the size of a word corresponds to how often it appears in given string.
To do this, you’ll need data. Write code that takes a long string and builds its word data in a dictionary, where the keys are words and the values are the number of times the words occurred.
For example, look at these sentences:
'This is test string so test it out'
'This is Just a string'
Read More About Software Engineer Interview Questions and Answers.
You can assume that input will only contain words and standard punctuation.
Step By Step Guide:
We’ll have to go through the entire input string, and we’re returning a dictionary with every unique word. In the worst case every word is different, so our runtime and space cost will both be at least O(n)O(n).
This challenge has several parts. Let’s break them down.
- Create array of words from given string
- Add each Word into dictionary
- Handling words that are both uppercase and lowercase in the input string
How would you start the first part?
You can implement split_words() function, which will let us iterate over the input string only once and generate the array of words.
Here’s a simple example.
It is working solution and you can cover different edge cases with punctuation, regex.
def split_words(input_string):
words = []
current_word_start_index = 0
current_word_length = 0
for i, char in enumerate(input_string):
if char.isalpha():
if current_word_length == 0:
current_word_start_index = i
current_word_length += 1
else:
word = input_string[current_word_start_index:
current_word_start_index + current_word_length]
words.append(word)
current_word_length = 0
return words
Every time we append a character to a string, Python makes a whole new string. If our input is one long word, then creating all these copies is O(n^2)O(n2) time.
Instead, we keep track of the index where our word starts and its current length. Once we hit a space, we can use string slicing to extract the current word to append to the list. That keeps our split function at O(n)O(n) time.
Now we’ve solved the first part of the challenge, splitting the words. The next part is populating our dictionary with unique words. What do we do with each word?
Solution:
ords_to_counts = {}
def add_word_to_dictionary(word):
if word in words_to_counts:
words_to_counts[word] += 1
else:
words_to_counts[word] = 1
Alright, last part! How should we handle words that are uppercase and lowercase?
Consider these sentences:
'We came, we saw, we ate cake.'
'Friends, Romans, countrymen! Let us eat cake.'
'New tourists in New York often wait in long lines for cronuts.'
Take some time to think of possible approaches. What are some other sentences you might run into. What are all your options?
When are words that should be lowercase not?
Why not?
What are the ideal cases we’d want in our dictionary?
Here are a few options:
- Only make a word uppercase in our dictionary if it is always uppercase in the original string.
- Make a word uppercase in our dictionary if it is ever uppercase in the original string.
- Make a word uppercase in our dictionary if it is ever uppercase in the original string in a position that is not the first word of a sentence.
- Use an API or other tool that identifies proper nouns.
- Ignore case entirely and make every word lowercase.
What are the pros and cons for each one?
Pros and cons include:
- Only make a word uppercase in our dictionary if it is always uppercase in the original string: this will have reasonable accuracy in very long strings where words are more likely to be included multiple times, but words that only ever occur as the first word in a sentence will always be included as uppercase.
- Make a word uppercase in our dictionary if it is ever uppercase in the original string: this will ensure proper nouns are always uppercase, but any words that are ever at the start of sentences will always be uppercase too.
- Make a word uppercase in our dictionary if it is ever uppercase in the original string in a position that is not the first word of a sentence: this addresses the problem with option (2), but proper nouns that are only ever at the start of sentences will be made lowercase.
- Use an API or other tool that identifies proper nouns: this has a lot of potential to give us a high level of accuracy, but we’ll give up control over decisions, we’ll be relying on code we didn’t write, and our practical runtime may be significantly increased.
- Ignore case entirely and make every word lowercase: this will give us simplicity and consistency, but we’ll lose all accuracy for words that should be uppercase.
Any of these could be considered reasonable. Importantly, none of them are perfect. They all have tradeoffs, and it is very difficult to write a highly accurate algorithm. Consider “cliff” and “bill” in these sentences:
'Cliff finished his cake and paid the bill.'
'Bill finished his cake at the edge of the cliff.'
You can choose whichever of the options you’d like, or another option you thought of. For this breakdown, we’re going to choose option (1).
Now, how do we update our add_word_to_dictionary() function to avoid duplicate words?
Think about the different possibilities:
- The word is uppercase or lowercase.
- The word is already in the dictionary or not.
- A different case of the word is already in the dictionary or not.
Moving forward, we can either:
- Check for words that are in the dictionary in both cases when we’re done populating the dictionary. If we add “Vanilla” three times and “vanilla” eight times, we’ll combine them into one “vanilla” at the end with a value 1111.
- Avoid ever having a word in our dictionary that’s both uppercase and lowercase. As we add “Vanilla”s and “vanilla”s, we’d always only ever have one version in our dictionary.
We’ll choose the second approach since it will save us a walk through our dictionary. How should we start?
If the word we’re adding is already in the dictionary in its current case, let’s increment its count. What if it’s not in the dictionary?
There are three possibilities:
- A lowercase version is in the dictionary (in which case we know our input word is uppercase, because if it is lowercase and already in the dictionary it would have passed our first check and we’d have just incremented its count)
- An uppercase version is in the dictionary (so we know our input word is lowercase)
- The word is not in the dictionary in any case
Let’s start with the first possibility. What do we want to do?
Because we only include a word as uppercase if it is always uppercase, we simply increment the lowercase version’s count.
# Current dictionary
# {'blue': 3}
# Adding
# 'Blue'
# Code
words_to_counts[word.lower()] += 1
# Updated dictionary
# {'blue': 4}
What about the second possibility?
This is a little more complicated. We need to remove the uppercase version from our dictionary if we encounter a lowercase version. But we still need the uppercase version’s count!
# Current dictionary
# {'Yellow': 6}
# Adding
# 'yellow'
# Code (we will write our "capitalize()" method later)
words_to_counts[word] = 1
words_to_counts[word] += words_to_counts[word.capitalize()]
del words_to_counts[word.capitalize()]
# Updated dictionary
# {'yellow': 7}
Finally, what if the word is not in the dictionary at all?
Easy—we add it and give it a count of 1.
# Current dictionary
# {'purple': 2}
# Adding
# 'indigo'
# Code
words_to_counts[word] = 1
# Updated dictionary
# {'purple': 2, 'indigo': 1}
Now we have all our pieces! We can split words, add them to a dictionary, and track the number of times each word occurs without having duplicate words of the same case. Can we improve our solution?
Let’s look at our runtime and space cost. We iterate through every character in the input string once and then every word in our list once. That’s a runtime of O(n)O(n), which is the best we can achieve for this challenge (we have to look at the entire input string). The space we’re using includes a list for each word and a dictionary for every unique word. Our worst case is that every word is different, so our space cost is also O(n)O(n), which is also the best we can achieve for this challenge (we have to return a dictionary of words).
But we can still make some optimizations!
How can we make our space cost even smaller?
We’re storing all our split words in a separate list. That at least doubles the memory we use! How can we eliminate the need for that list?
Right now, we store each word in our list as we split them. Instead, let’s just immediately populate each word in our dictionary!
Solution
In our solution, we make three decisions:
- We use a class. This allows us to tie our methods together, calling them on instances of our class instead of passing references.
- To handle duplicate words with different cases, we choose to make a word uppercase in our dictionary only if it is always uppercase in the original string. While this is a reasonable approach, it is imperfect (consider proper nouns that are also lowercase words, like “Bill” and “bill”).
- We build our own split_words() method instead of using a built-in one. This allows us to pass each word to our add_word_to_dictionary() method as it was split, and to split words and eliminate punctuation in one iteration.
To split the words in the input string and populate a dictionary of the unique words to the number of times they occurred, we:
- Split words by spaces, em dashes, and ellipses—making sure to include hyphens surrounded by characters. We also include all apostrophes (which will handle contractions nicely but will break possessives into separate words).
- Populate the words in our dictionary as they are identified, checking if the word is already in our dictionary in its current case or another case.
If the input word is uppercase and there’s a lowercase version in the dictionary, we increment the lowercase version’s count. If the input word is lowercase and there’s an uppercase version in the dictionary, we “demote” the uppercase version by adding the lowercase version and giving it the uppercase version’s count.
Python Solution:
class WordCloudData:
def __init__(self, input_string):
self.words_to_counts = {}
self.populate_words_to_counts(input_string)
def populate_words_to_counts(self, input_string):
# Iterates over each character in the input string, splitting
# words and passing them to add_word_to_dictionary()
current_word_start_index = 0
current_word_length = 0
for i, character in enumerate(input_string):
# If we reached the end of the string we check if the last
# character is a letter and add the last word to our dictionary
if i == len(input_string) - 1:
if character.isalpha():
current_word_length += 1
if current_word_length > 0:
current_word = input_string[current_word_start_index:
current_word_start_index + current_word_length]
self.add_word_to_dictionary(current_word)
# If we reach a space or emdash we know we're at the end of a word
# so we add it to our dictionary and reset our current word
elif character == ' ' or character == u'\u2014':
if current_word_length > 0:
current_word = input_string[current_word_start_index:
current_word_start_index + current_word_length]
self.add_word_to_dictionary(current_word)
current_word_length = 0
# We want to make sure we split on ellipses so if we get two periods in
# a row we add the current word to our dictionary and reset our current word
elif character == '.':
if i < len(input_string) - 1 and input_string[i + 1] == '.':
if current_word_length > 0:
current_word = input_string[current_word_start_index:
current_word_start_index + current_word_length]
self.add_word_to_dictionary(current_word)
current_word_length = 0
# If the character is a letter or an apostrophe, we add it to our current word
elif character.isalpha() or character == '\'':
if current_word_length == 0:
current_word_start_index = i
current_word_length += 1
# If the character is a hyphen, we want to check if it's surrounded by letters
# If it is, we add it to our current word
elif character == '-':
if i > 0 and input_string[i - 1].isalpha() and \
input_string[i + 1].isalpha():
current_word_length += 1
else:
if current_word_length > 0:
current_word = input_string[current_word_start_index:
current_word_start_index + current_word_length]
self.add_word_to_dictionary(current_word)
current_word_length = 0
def add_word_to_dictionary(self, word):
# If the word is already in the dictionary we increment its count
if word in self.words_to_counts:
self.words_to_counts[word] += 1
# If a lowercase version is in the dictionary, we know our input word must be uppercase
# but we only include uppercase words if they're always uppercase
# so we just increment the lowercase version's count
elif word.lower() in self.words_to_counts:
self.words_to_counts[word.lower()] += 1
# If an uppercase version is in the dictionary, we know our input word must be lowercase.
# since we only include uppercase words if they're always uppercase, we add the
# lowercase version and give it the uppercase version's count
elif word.capitalize() in self.words_to_counts:
self.words_to_counts[word] = 1
self.words_to_counts[word] += self.words_to_counts[word.capitalize()]
del self.words_to_counts[word.capitalize()]
# Otherwise, the word is not in the dictionary at all, lowercase or uppercase
# so we add it to the dictionary
else:
self.words_to_counts[word] = 1
JavaScript Solution:
class WordCloudData {
constructor(inputString) {
this.wordsToCounts = new Map();
this.populateWordsToCounts(inputString);
}
populateWordsToCounts(inputString) {
// Iterates over each character in the input string, splitting
// words and passing them to this.addWordToMap()
let currentWordStartIndex = 0;
let currentWordLength = 0;
for (let i = 0; i < inputString.length; i++) {
const character = inputString.charAt(i);
// If we reached the end of the string we check if the last
// character is a letter and add the last word to our map
if (i === inputString.length - 1) {
if (this.isLetter(character)) {
currentWordLength += 1;
}
if (currentWordLength > 0) {
const word = inputString.slice(currentWordStartIndex,
currentWordStartIndex + currentWordLength);
this.addWordToMap(word);
}
// If we reach a space or emdash we know we're at the end of a word
// so we add it to our map and reset our current word
} else if (character === ' ' || character === '\u2014') {
if (currentWordLength > 0) {
const word = inputString.slice(currentWordStartIndex,
currentWordStartIndex + currentWordLength);
this.addWordToMap(word);
currentWordLength = 0;
}
} else if (character === '.') {
if (i < inputString.length - 1 && inputString.charAt(i + 1) === '.'){
if (currentWordLength > 0) {
const word = inputString.slice(currentWordStartIndex,
currentWordStartIndex + currentWordLength);
this.addWordToMap(word);
currentWordLength = 0;
}
}
} else if (this.isLetter(character) || character === '\'') {
if (currentWordLength === 0) {
currentWordStartIndex = i;
}
currentWordLength += 1;
} else if (character === '-') {
if (i > 0 && this.isLetter(inputString.charAt(i - 1)) &&
this.isLetter(inputString.charAt(i + 1))) {
currentWordLength += 1;
} else {
if (currentWordLength > 0) {
const word = inputString.slice(currentWordStartIndex,
currentWordStartIndex + currentWordLength);
this.addWordToMap(word);
currentWordLength = 0;
}
}
}
}
}
addWordToMap(word) {
let newCount;
// If the word is already in the map we increment its count
if (this.wordsToCounts.has(word)) {
newCount = this.wordsToCounts.get(word) + 1;
this.wordsToCounts.set(word, newCount);
// If a lowercase version is in the map, we know our input word must be uppercase
// but we only include uppercase words if they're always uppercase
// so we just increment the lowercase version's count
} else if (this.wordsToCounts.has(word.toLowerCase())) {
newCount = this.wordsToCounts.get(word.toLowerCase()) + 1;
this.wordsToCounts.set(word.toLowerCase(), newCount);
// If an uppercase version is in the map, we know our input word must be lowercase.
// since we only include uppercase words if they're always uppercase, we add the
// lowercase version and give it the uppercase version's count
} else if (this.wordsToCounts.has(this.capitalize(word))) {
newCount = this.wordsToCounts.get(this.capitalize(word)) + 1;
this.wordsToCounts.set(word, newCount);
this.wordsToCounts.delete(this.capitalize(word));
// Otherwise, the word is not in the map at all, lowercase or uppercase
// so we add it to the map
} else {
this.wordsToCounts.set(word, 1);
}
}
capitalize(word) {
return word.charAt(0).toUpperCase() + word.slice(1);
}
isLetter(character) {
return 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'.indexOf(character) >= 0;
}
}
Complexity
Runtime and memory cost are both O(n)O(n). This is the best we can do because we have to look at every character in the input string and we have to return a map of every unique word. We optimized to only make one pass over our input and have only one O(n)O(n) data structure.
What We Learned
To handle capitalized words, there were lots of heuristics and approaches we could have used, each with their own strengths and weaknesses. Open-ended questions like this can really separate good engineers from great engineers.
Leave a Reply