Tuesday, January 15, 2019

100 Years of Plane Crashes

Idea:

I discovered a great data set on Kaggle today that really hit close to home for me.  On the off chance that you haven't leisurely perused my bio section just yet, let me save you from a few moments of time spent: I have a background as a Flight Engineer on the Orion P-3 aircraft, gained while serving in the US Navy for 8 years.

So imagine my surprise when I found data relating to aircraft- although it is a somber context, it was also really awesome to go back to my flying days for a moment.  During our training, we would learn about various types of aircraft crashes and what went wrong.  Then we would discuss what we would have done differently and the lessons learned from the crash.  This training was quite extensive and very important. You remember that pilot who landed the airplane in the Hudson River like a superhero? Hello, previous military training. (And maybe his 30+ years of aviation experience after the military helped, but it was mostly the military training.) Nevertheless, it is not an understatement to say that the knowledge of previous aircraft crashes and potential cause/effect/correction is a HUGE necessity in the world of aviation.

This data set was great to stumble upon given my previous career in flight, and I was eager to gain more insight into the history of aircraft crashes around the world.  My idea was to take the existing data set and build something that could give viewers insight about various aircraft crashes. I wanted this project to be something that could be used by any person or organization to view and potentially learn from others' mistakes.

This data viz is dedicated to those who have lost their lives in flight and their families.

Data Viz:

Insight:

This data visualization can be used to gain a lot of insight and information about aircraft crashes over the last 100 years. Let's explore the data below.
When looking at how many lives were lost compared to those that survived, we can look at the data chart towards the bottom (Casualties by Year) and see that there are peaks during two time periods that show a significantly higher number of survivors than on the average airplane crash. Why were there more survivors during these two periods? Unfortunately, the data doesn't give us any insight into the WHY of things. That is a problem best left for Google. Perhaps we can spend some spare time looking up the individual crashes and seeing what happened to set them apart? Keep in mind, however, that this is data that spans crashes over the entire world, so we are talking about multiple flights for each recorded year. I think it would be a fair assessment to conclude that in 1998-2000, the pilots manning the controls were well-trained and amazingly prepared for disaster.  This time period had a HUGE number of survivors when compared to other time periods.  This data could be used within the aviation field to examine training methods, average pilot experience, particular aircraft safety, and more in an attempt to understand and further strengthen the abilities of the pilots who managed to secure the safety of such a large number of passengers. This chart also displays two more interesting tidbits of information.  First of all, we can see that the overall amount of crashes starts to go down dramatically around the year 2000.  This is amazing- we are making progress in the long journey of flight travel, making it faster, more efficient and safer than ever. The second thing we can determine from this chart is that more people are starting to survive plane crashes in general- especially when compared to the 1940s to 1980s. In the 40's, we had crashes we can attribute to World War II. In 1958, Pan America (PanAm) launched the Boeing 707 flight from New York to London, creating the availability of commercial trans-Atlantic flights. If Frank Abignale, Jr. was able to walk onto a plane and fly it, no questions asked, then PanAm's credentials really weren't up to snuff and crashes were bound to happen. (If you've never heard of Frank Abignale, Jr., you really need to brush up on your Leo DiCaprio movies, it's one of his best.) All kidding aside, it took a couple of decades for aviation experts to be able to devise aircraft and training procedures that started to result in crashes being minimized and lives being saved. This is great news! We first see a rise in crash survivors, then a trending drop in crashes themselves. Amazing!
The clickable circle on the left side shows all of the airplane crashes side by side. Each crash is displayed as a dot, colored from light to dark based on the number of casualties per flight, whereas the size of the dot symbolizes the total amount of people on board the flight. The first thing we see when looking at this circle is that there are two large, deep-red dots that immediately stand out.  These dots signify a large number of people on board the aircraft as well as a high number of casualties when the airplane crashed.  In the circle, you can move your mouse over each dot to display a summary of the crash and explain what happens.  This is a great way to display a large amount of information in a limited space, while still maintaining a visual appeal.
The chart on the bottom right shows the total number of lives lost due to aircraft crashes.  As the years progress, we see the total count of casualties with a very low casualty total until 1940, and a  rapidly rising slope from 1940 onward. If you look closely, however, you can see that the rise of the slope starts to taper off at the top. This reiterates the fact that aircraft casualties are started to decline significantly compared to previous years. Travel by flight is slowly becoming safer and safer as time goes on.
The chart on the top right is especially useful for frequent travelers. This chart classifies the total casualties by an organization. A viewer can use this chart to determine which airline to avoid for their future trips.  Some things to consider when exploring this data is the amount of time the company has been in the flight business. For example, some of the airlines on this chart are fairly new, but already have a high count of lives lost due to aircraft crashes. (Hint: these companies are at the bottom).  The other interesting name on this chart is Pan America.  They only lasted about 64 years in the commercial flight industry and claimed over 1,000 lives in the process.  (It makes you wonder the real reason they are out of business...and no, Mr. Abignale never crashed his flights.)
The last thing I want to mention this data visualization is the ability to filter the data by Aircraft and/or Organization. You can use the drop-down menu at the top right corner of the chart to select these filters, and the chart will adjust accordingly. The image above shows the data after I applied a filter for the aircraft I was a part of during my time in the US Navy.  (Pop quiz: What aircraft did I fly in the US Navy? If you scroll up, that's cheating.) By selecting the P-3 (or Lockheed Orion, it's a non-military name) as the aircraft type, the results appear as shown above. You might notice that you can see two (2) large crashes which claimed a large number of lives.  The two peaks in 1968 and 1973 show those two large crashes on the bottom left chart.  While you can ask a Navy FE what happened, they will not only tell you the story in detail over a few beers, but they will also tell you what they would have done to save the day. Flight engineers get a lot of the stress, very little of the decision making, and none of the credit. (But I'm biased, of course.)

Project:

This project idea came to me after coming across a great data set found on Kaggle.com from Sauro Grandi. (Thank you, Sauro, you are amazing.)  I saw a great chance to analyze various aircraft crashes and discover if there were any patterns or insight related to the casualties caused by aircraft crashes.

Tools:

Data:

The data can be originally found on Kaggle.com.  The data used is version 4 and downloaded as a CSV.

Data Cleaning:

This data came with a large amount of information.  The only data needed to be was related to the casualties of each flight.  For example, this particular project required the number of casualties per crash, the number of survivors on board, the date of each crash, and the airline and aircraft involved. The existing data set also offered a location of each crash, which needed to be cleaned for geolocation, but was not necessary for this project.  It could easily be cleaned at a later date to analyze the location of the crash if desired.

Process:

The first thing I needed to do was import the data into Tableau Public.  This required the CSV file to be opened in Microsoft Excel and then saved as an Excel workbook because the public version of Tableau does not import CSV type files. (This is a feature of the Tableau paid version, however.) Once the file was imported, I wanted to see the casualties over time using a line chart display.  I added in the total number of passengers aboard on the same axis of the chart, but in a different color.  The "aboard" total was placed behind the casualties graph to show a trend in years where there were more survivors (if any).  This was displayed with the yellow color (total passengers aboard) above the red (fatalities) on the graph. The second chart I created was the bubble chart (the large circle to the left), to show each plane crash in comparison to others.  Because each bubble represents an aircraft crash, the circle was a perfect visual representation of this data.  I assigned a color to each flight based on the severity of casualties, and the size based on the total number of passengers. Large circles that are light yellow signify more survivors, whereas large circles in deep red signify more lives lost in the crash. To create the chart that displays the running total of lives lost by the year (at the bottom right), I wanted to create an area line chart to give a visual representation for the running total of casualties over time.  The starting year for the chart is 1909 and the ending year is 2009.  By using a time span of 100 years, we are able to get a clean and compelling visual. For the next chart, it seemed important to know which organization has had the most fatalities throughout the 100-year time frame.  I created a top ten chart based on the total amount of lives lost in the crashes and color-coordinated the chart to match the rest of the display. The last portion of this project was adding the filtering options at the top right of the display.  The filters used include the type of aircraft, year, and the organization.  The use of these filters helps the user pinpoint any particularly relevant information they may want to see in the chart. Please take a moment to remember those who have lost their lives to these tragic events in history.  They should be remembered and not forgotten.

Sunday, January 13, 2019

Percent of US Population Getting Minimum Wage or Lower.

Idea:

My main goal, for this project, was to see if I could illustrate the common problem of data misinterpretation. The idea for this data visual comes FRESH to you from the most recent Make Over Monday project- Week Three (3) of 2019.  The data visual in its original form is displayed with a blue monochrome color scheme, and color-coded in different shades of blue based on the percentage of people being paid at or below minimum wage by state.  You can see the original data visualization at the very bottom of this post.

Because nothing says 'trolling' like politics, I thought it would be fun to break this data up into time (year) and presidency term.  Before anyone starts panicking, I did not include our most current presidency in this data visualization- we can save that bickering for facebook.

The data included information from 2002 up to 2017.  With this data, I was able to compare the Bush Administration to the Obama Administration in terms of wage levels earned by the American people.  The goal for this project is to display the information by state, year, and political party.  Also, just for fun, I have included a picture of each president before and after their 8-year presidency terms.

Data Viz:

Here is a link to the Tableau Portfolio 

Insight:

One of the most difficult things when analyzing research, including statistics and data evaluation, is to examine and present the data without any personal biases. Here's the catch- data can be misread and used wrongly without too much effort and sometimes even by accident.  We see it all the time in the news- most recently with an interesting debate about whether or not coffee is good or bad for you, for example. (Regardless, I still enjoy about 7 cups of coffee per day...)

Using this chart, let me show you a great example of how data can be misinterpreted.  Let's hop into the DeLorean (did I just age myself?) and travel back to 2008, say around the end of year-ish.  Do you remember what happens in late 2008?  BOOM!  The housing market collapses.  Bush was working hard to stimulate the economy (and soften the blow) with subsidies, payouts, and stimulus checks, among other efforts. It was a rough year. Yeah, now you remember.  On top of all of that, Obama was about to take office in a few short months.  What better way to come into office than cleaning up someone else party mess (pun intended)? Cue the mom from Mrs. Doubtfire, this is her jam.


The data here shows that wages are extremely low during Obama's first two (2) years of presidency, but it begins to trend back down.  What we can conclude from this data is that the effects of 2007 and 2008 put a massive strain on the economy, which we started to see in 2008 but really felt the blow in 2009 and 2010. This shows what a huge impact the housing market has on the overall economy.  Not only were stocks plummeting, but employees' paychecks took a hit- a hard hit.  This is a classic form of economic cause-and-effect. While the Obama Administration was able to reverse the effects throughout his presidency, it was a continual work in progress that took several years.

This being said, it is easy to look at this data and, without thinking too much about the circumstances behind the data, you can see that the first two years of the Obama Administration held the highest level of low wages on the entire chart. If jumping to initial conclusions simply by looking at this chart, it looks like the Obama Administration was a wage-wrecker.

The biggest insight for this data viz, then, is to remember what is happening in your data- what does the data MEAN.  While a data chart might show one thing right away, as the viewer we need to understand the reasons for each reading. This is why you will almost never see a data visualization without some sort of background information accompanying it- not in good data science, and not in good journalism. Even with background information, it can still be easy to misinterpret data. It is our responsibility as data-visualists to try and eliminate that as much as possible and remain unbiased at all times.

Project:

This was a data set for Makeover Monday on Data.world.  The project was to makeover a data viz from Business Insider.  The task was to improve the data viz. This data set is about the percentage of people in the US who gets paid the federal minimum wage or lower.

Tools:

Data:

The data set comes from Data.world.  However, the original data comes from the Bureau of Labor Statistics.

Data Cleaning:

The data was clean and ready for importing into Tableau.  The only change needed was changing the numbers from decimals into percentages.

Process:

The first thing I needed to do was make the small maps of the USA.  I combined the data from all 8 years of each presidency and filtered the image to show the average percentage of the population, by state, that made at or below minimum wage. For example, the average wages of each state during the Bush Administration is displayed in the red map, and the average wages of each state during the Obama Administration is shown in the blue map.  I went with three shades of each color, to uniformly represent the severity of each state, ranked light to dark from GOOD, AVERAGE, BAD.  This was done for both maps, with the darkest color being the lowest (worst) wages. To keep things within the political theme desired, I used red and blue for the political party affiliation of each president. This would not have worked if both presidents had been from the same political party, so I lucked out with that one.

My second goal was to display the change in wages over time in yearly increments.  I decided to use a bar chart and balanced it out visually with vertical rows of stars along each side border, for a stars and stripes theme.  I kept the red and blue color scheme, corresponding to the years of each president.  This split the dashboard up into two (2) sides extremely well for an easy but aesthetically pleasing effect.

The last thing I needed to really set the visual over the top was a picture of each President.  I chose a picture of each president at the beginning of their presidency, and another picture of them at the end of their presidency. I used this to give a fun (but somewhat shocking) visual display of the rapid aging process that occurs when taking on a role as significant as the President of the United States. That is one job that will age you faster than any other.

Overall, it hopefully comes together in a fun and informational visual display, and the original data set has new significance when broken down into the terms of presidency, adding a little extra perspective to the original visualization.

Saturday, January 12, 2019

The Many Flavors of OREO

Idea:

Who doesn't love Oreos? (I mean, apparently some people...but Oreos are definitely one of America's most beloved snacks.) These self-proclaimed 'Wonderfilled' cookies have an obsession with creating a mind-blowing number of wild new flavors...I'm pretty sure I see new flavors everytime my wife sends me to the store. I'm also pretty sure that my wife sends me to the store so I can stock up on junk food and she doesn't have to feel guilty about it. But I'm cool with that.

When I stumbled upon an Oreo taste testing data set, my brain went instantly into midnight snack mode.  This particular data set is from the Famous Kaggale Data Scientist Master Mind, Dr. Rachael Tatman.  Rachael created a very simple survey of twelve (12) Oreo flavors, using 5 taste testers.  The dataset would be perfect for illustrating how to read data and to explain what one would look for when reviewing their data results.  Plus, it talks about 'Americas Favorite Cookie,' the OREO! With a fresh glass of milk in hand, I embarked upon the goal of turning Rachel's awesome Oreo data set into a visual that would be almost as appealing as the cookies themselves.

Data Viz:

Here is a link to the Tableau Portfolio Page

Insight:

This was a great data set to explain what one would look for to make a Tasteful decision.  No more standing in the cookie aisle, unable to decide whether those new Mint Oreos or Red Velvet Oreos would be better. The first image highlights the overall distribution of the data.  As you can see, the data is a left-tailed skewed distribution.  This basically means that the peak of the data is on the left side, with the higher score being closer to 5.  With only 5 taste testers, these cookies were approved by everyone who tried them. (In grocery shopping terms, these are the flavors the whole family will enjoy.)


Next, we have the bar chart of the average score for each flavor.  Since this data was on a small scale from 1 to 5, the average gives an overall picture of the score for each flavor.

This type of chart works great for visually displaying product review scores. The three white bars at the left are the lower scoring flavors.  The light blue horizontal bar is the average score of all the flavors.  The three flavors that score below the average could possibly lead to some interesting information in terms of product approval. Surprisingly, these testers did not like my personal fave, the MEGA STUFFED Oreos. It's sad, I know, but data doesn't lie. (Don't mind me while I drown my sorrows with another row of cookies.)

One thing to look for when creating data visuals is whether or not you have all the necessary data.  In this case, we can assume that data is missing because one of the taster testers only gives responses to 2 of the 12 flavors.  This could lead to skewed results. This is why it is important to pick a MEGA STUFFED sample of taste testers- most say above 30 or more people- to get a more reliable score.  If one of the taste testers is unable to complete the entire survey, it would not dramatically alter the results in a pool of 30 people as it would with only 5 testers.  (Whew, hang in there MEGA STUFFED, we still love you!)


The last chart helps us understand the cookie eaters in a visual context.  This chart shows the average score given by each person and compares it with the rest of the taste testers.  The dashed line sets a parameter for the total average score given by a cookie dunker.  As you can see, the two tasters on the left seem to give lower scores in general.  They have been labeled as a HATER of Oreos when compared to the rest.  (This type of analysis can help you understand how the respondent feels about the review overall, which is a handy tool for data evaluations.)  Since 2 out of 5 give lower scores, we could say that we have more Oreo LOVERS in the overall sample. I hate to keep bringing this up, but these two low-score givers must have also given low scores to the MEGA STUFFED OREOS.  I chalk that up to not having milk. There's simply no other explanation.


Overall this type of analysis is great for finding out what the responses are to a particular survey.  This is an excellent way to view the responses in a grand-scale context and make a better-informed decision for future products.  In this case, Oreo could use this data to see which flavors were a slam-dunk, and which ones crumbled under the pressure. (See what I did there? I'm so punny.)

In this particular data, we could also investigate the three lower scoring cookie flavors and ask why are these lower?  Maybe, for integrity sake, we should compile a group of willing taste-testers to give these flavors a second chance. But please, make sure you have a jug of fresh milk with you, because this could lead you down a long trip down the rabbit hole.

Project:

This project was created to help explain how to visually display a survey based dataset.  The survey type of data compilation is widely used for gathering informational data.  It helps provide insight on making informed decisions on what is working and/or what is not working.  This project will help explain how to analyze this data type to provide extremely useful insight.

Tools:


  • Tableau Public

Data:

This data set was discovered on Kaggle.com and created by the Famous Dr. Tatman (Kaggle Data Scientist Extraordinaire).  It is a survey conducted style of data that is very commonly used to gather intelligence.  This set caught my attention simply because it was discussing OREOS- and Oreos are simply awesome and delicious. Therefore, this had to be done.


Data Cleaning:

The data set had missing data, which was displayed as NULL in various fields and needed to be cleaned to adjust this. This was easily fixed in Tableau by filtering the data.

Process:

This data viz was super simple and perfect for those looking for quick insight into a particular survey.  The data was imported into Tableau.  I first selected the responses, then right-clicked the selection, and chose the "Pivot" option.  This stacks the data perfect for creating a visualization in Tableau.

With the first chart, we needed to figure out the distribution shape of the data. I did this by selecting the "Pivot Field Values" for the values, and then selecting the histogram for the graph option.  Doing this, the 'Milk Dunk' histogram is automatically made.  This one was a left-tail skewed shape.

The Second chart is a bar chart which displays the average scores for each flavor.  The average was used because this type of data was provided in the survey using a scale from one (1) to five (5).  The "Pivot Field Names" and "Pivot Field Values" were placed on the workspace to represent the flavor and score.  More specifically, the "Pivot Field Values" were averaged to get an overall picture of the scores from all five (5) samples.  I added a horizontal line (using the data from the analytics tab) to show a visual display of the overall average (or standard) score. This helps the viewer to easily see which scores are lower and which scores are higher, in comparison to the average score.

The last chart was implemented to help understand what type of cookie eaters tried Oreos.  Simply put, are these taste testers positive or negative in their overall responses? (Are they Oreo lovers, or Oreo haters?) Honestly, I must say that these taste testers do not seem to be very positive, as is evidenced by the horribly disgraced rating they gave to the MEGA STUFFED OREO!  (Hang on, I need a minute, it still hurts.) Okay...back to work. I wanted to display this last chart as a plot chart, but it could have also been created as a bar chart.  For visual purposes, I felt that a plot chart provided a better balance for the dashboard's overall appeal.  The same concept was done with this plot chart as with the previous bar chart (for the flavor scores above).  The plot chart helps us identify what type of people are taking the survey by viewing the taste tester's cookie ratings as an averaged score.  I divided this score into two categories based on whether their average score was higher or lower, giving us two people who "HATE" Oreos, and three who "LOVE" Oreos.

You can use these same principals with any survey-based data set to get a great visual of the information provided.

Friday, January 11, 2019

Etsy Shop Keyword Tag Scraping With Python

Idea:

This project stemmed from the idea of helping my wife with her Etsy shop. She's been running her shop on Etsy for a couple of years now, and to make her life a little bit easier I wanted to use my data skills to help her do some product research. Happy wife, happy life.

My main goal was to try and generate a CSV file with all the keywords tags that were used in each of the top fifty (50) products from a specific search query on Etsy.  I would then be able to use this data to discover what keywords other sellers are using, in hopes to improve Etsy product listings by adding the most popularly used keywords to existing listings (for example on my wife's products). Theoretically, this would improve the odds of these listings appearing in relevant Etsy searches.

**Sidenote: There are now two (2) organizations that have built a great user interface online cloud platforms with this similar type of idea.  Both websites provide this date as well as more insight than what is available with just a simple scrape.  Here are both links: marmalead.com and etsyrank.com**

Data Viz:



Insight:

The best insight for this project was finding keywords to use in a listing.  When reviewing the word cloud, I want to pay close attention to the little words just as much as the larger words.  The little words might provide great long-tail keyword tags that I may be missing.

The other thing we are looking for, that might impact our listing, is the different spelling variations or phrases used in the product listing.  Finding this type of data nuggets can help provide a strong Etsy SEO.  Some search queries may be different from the correct spelling.


I would also stack the words by the count that they show up in the column.  This will give an exact amount of time the words showed up in different listings.  This can provide a must-have in the keyword tags and what words are possibly being underutilized.


Keyword_Tag
COUNTA of Keyword_Tag
Clothing98
Tops & Tees94
Unisex Adult Clothing84
T-shirts84
dance shirt70
gift for dancer50
dancer shirt50
dancing shirt46
dance42
dancer gift38
*This is the top ten words used in the example

This same style of analysis could also be used on the title of the product.  The product title can be thoroughly analyzed to find the best words and how the words are organized in the listings.  This requires an understanding of statistics and how to get the range of the location, but it can be completed.

You can also perform a cost analysis of the products being sold.  This would help you to decide what price point you might want to set your product.

And lastly, you may be wondering why we needed the seller name.  Well, let me tell you it provides great insight.  This information can provide you with how many times a seller shows up on a particular keyword search.  This would be a bigger competitor in terms of ranking.  It would be worth investigating more about the seller to help improve your product listing in Etsy.

Project:

Etsy is an online marketplace where a seller who makes things (such as crafts) can sell their goods to customers.  Etsy is similar to Amazon and eBay but targets the sale of handmade goods and craft supplies.  Their platform allows for thirteen (13) keywords for each product listing, which they call keyword tags.  These keyword tags are what help get the product noticed in the search algorithm.  Each of these keyword tags is displayed on the product page near the bottom.

This project is going to scrape those keyword tags.

Tools:

Anaconda 
Python 3.7
Jupyter Notebook
Pandas
Beautiful Soup
Google Sheets

Data:

The data is from Etsy.com in an unstructured format.  This scrape will be used to build the data set.

Here is the example dataset generated in Google Sheets.

Data Cleaning:

The data cleaning was completed in Google Sheets to help build a quick word cloud.

The data we will be using to gather in this blog post is the search term in Etsy "Dance Shirts."  Here is what the CSV file looked like when I ran the code.

Process:

The coding will be written in Python on a Jupyter Notebook.

In order to scrape a site, I will need to install Beautiful Soup and Pandas packages into Python.  I am using Anaconda, which provides the packages already downloaded for Python.

Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Next is to fetch that data from Etsy.com.  I need to get the website of the page I am going to scrape and look at the HTML code from the site.  The HTML code will be broken into tags and that is what breaks up the elements I need to extract.  What we are looking for is the results from a search query on Etsy, and from that page we will be learning the structure.  This will help give our crawler a list of products.

Code:
# this is the keyword search term for etsy
query = "dance+shirt"
#this is getting calling the page in python
etsy_page_search = requests.get("https://www.etsy.com/search/?q=" + query)
soup_search = BeautifulSoup(etsy_page_search.content,"html5lib")

The list data build needs to be put into a new data frame, which we will build a list for the next part of the crawl to use.

Code:
#This is the listing id list
listing_id = soup_search.find_all("a")
#This holds the listing url
list_id_records = []
keywords_records = []

#this gather listing url by listing id and adding to website address
for listing in listing_id:
    list_id = (listing.get("data-listing-id"))
    if list_id != None:
        url_product = "http://www.etsy.com/listing/" + str(list_id) +"/"
        list_id_records.append(url_product)

After building the code to get the 50 products on the page, we need to open each product page and have the crawler scrape this page.  In this crawl, we the main data we are looking to grab from each page is as follows:
  • Title of the product
  • Name of the Seller
  • The sale price of the product
  • The keyword tags
Code:

 #getting product page information     
for list_id in list_id_records:
    etsy_page_product = requests.get(list_id)
    soup_product = BeautifulSoup(etsy_page_product.content,"html.parser")
    keywords_list = soup_product.find_all("a", {"class":"text-center btn btn-link tag-button-link"})
    for keywords in keywords_list:
        keyword = keywords.text        
        title = soup_product.find("h1", {"class":"mb-xs-2 override-listing-title break-word"}).text
        seller = soup_product.find("span", {"itemprop":"title"}).text
        price = soup_product.find("span", {"class":"currency-value"}).text
        keywords_records.append((title, seller, price, keyword))

The way this data will output is by putting thirteen (13) keywords in one row, making thirteen columns of keywords.  What we want is to have all of the keywords in one column.  By placing them in one column, we also have to repeat a lot of the information again on the listing.  Guess how many times? You guessed it- thirteen (13) times.

The reason for this madness is to make it easier to build the word cloud picture in Google sheets and get a simple analysis of the count of the words.

After we gather all this information we need to store it in a CSV file.  I used pandas to help build the data frame and make it a CSV file.  The CSV file generated by the code will then be used in Google Sheets.

df = pd.DataFrame(keywords_records, columns =["Title","Seller","Current_Price","Keyword_Tag"])
df.to_csv(query + ".csv", index=True, encoding="utf-8")
len(keywords_records)

Inside Google sheets, we want to open our CSV as a Google Sheets file.  This will give the spreadsheet with all the information we needed to scrape.  We will take the keyword column and use that column to build the word cloud.  The larger the word, the more commonly it is used.

Wednesday, January 9, 2019

The Strongest Comicbook Star

Idea:

What better way to get ready for a year full POW, KAABOOM and SHAZZZAM than with a Superhero data viz!  This idea came from a data set I found on kaggle.com.  After searching all over the globe (ok, mainly the internet) this data set just jumped out at me- like a superhero kick to the geeky side of the head.  I should do a data visual on comic book characters and their stats!  (It's the small things that make a comic-book data nerd excited, I tell ya.)  The idea is to compare the stats of various superheroes, pairing up one character with another in a battle of the stats.  Who wants to place a bet...will it be YOUR favorite superhero who rules the data today?

Data Viz:


Insight:

First, let's consider some of the things that make a hero, well, SUPER. We can use data to evaluate each superhero by strength, speed, durability, intelligence, combat skill and overall power.
To get started, let's take a look at Batman vs. Ironman- which hero is stronger based on these compiled hero skills?

Any comic fan knows these two popular heroes (each from different superhero worlds, no less), but what if they had to go up against each other?! Let's find out! It's the battle of rich men with cool gadgets!  Both of these superheroes have money, and both have super cool toys to fight crime with. But who would win in a duel?

When comparing Batman vs. Ironman, it becomes pretty clear that while their bank accounts may be similarly stacked, Iron Man has a far wider variety of skills. Let's hope Alfred stocked up on the first aid kit. Don't believe me? Well, let's take a look at the chart:


Our next match up is all about the ladies...  Sorry, boys, but these girls got some POWER! Sticking with the trend of picking heroes from two different comic universes, let's see what Wonder Woman and She Hulk bring to the table.  Let the stiletto-wearing warrior battle begin!


Well... it is safe to say all that hand-to-hand combat training and extra study hours have paid off for Wonder Woman. No wonder she brings so many boys to the yard.

In honor of the hilarious friend-feud between Ryan Reynolds and Hugh Jackman, let's match up Deadpool and Wolverine.  This battle ends in a close tie, so it's a good thing they get along (sort of).  The only real edge Deadpool has is with a higher power stat.  But considering both heroes can regenerate, this battle would definitely be a 10-rounder.


Don't be shy...use the table to match up YOUR favorite heroes and villains to see who would win in a match.  Who knows, you might be KAPOWED! by the results.

Project:

The project was to use the existing data set found on kaggle.com to make something fun and different.  The idea of using comic book characters and comparing their individual stats brings back the good old days of trading cards on the playground.  The best thing about those old school trading cards was the stats.  This project, then, is like virtual baseball cards for nerds. (Or would it be like Pokemon cards for cool people?)

Tools:

  • Table Public
  • Adobe Photoshop

Data:

The data set came from Kaggle user Daniel R (Thank you, you're awesome).  Daniel R has a large selection of data perfect for this type of analysis.  The data set chosen specifically for this project was the character stats CSV file:

Data Cleaning:

This set came with everything needed to start an analysis.  The only clean up required was with correcting the empty, or "null", fields.  This can be easily completed in Tableau.

Process:

The process begins with the download of the CSV file, to then import into Tableau.  (The version of Tableau used was Tableau Public 2018.3.)  Tableau does not play well with the structure of this dataset.  The dataset needed to be pivoted to help Tableau read it easily, which was done by selecting the skill type of columns and right-clicking.  By then selecting the Pivot option, the data moves into a more manageable structure.  This will make the process of generating the radar chart easier.

The first order of business was to create a spider chart to display the stats of each character.  To create the X-axis part of the circle design is to make some "Calculated Field..." entries.  The X stats formula used:

CASE [Stats Name]
WHEN 'Combat' then 0
WHEN 'Durability' then [Stats Value] *SIN(2*PI()/6)
WHEN 'Intelligence' then [Stats Value] *SIN(2*2*PI()/6)
WHEN 'Power' then [Stats Value] *SIN(3*2*PI()/6)
WHEN 'Speed' then [Stats Value] *SIN(4*2*PI()/6)
WHEN 'Strength' then [Stats Value] *SIN(5*2*PI()/6)
END 

The X stats axis is followed up with the Y stats:

CASE [Stats Name]
WHEN 'Combat' then [Stats Value]*COS(0)
WHEN 'Durability' then [Stats Value] *COS(2*PI()/6)
WHEN 'Intelligence' then [Stats Value]*COS(2*2*PI()/6)
WHEN 'Power' then [Stats Value] *COS(3*2*PI()/6)
WHEN 'Speed' then [Stats Value] *COS(4*2*PI()/6)
WHEN 'Strength' then [Stats Value]*COS(5*2*PI()/6)
END

The next part was to add the pills to the worksheet.  This creates the layout of small little circles.  To make the area chart, just change to "polygon" and select the line type for Stat Names.  This creates the weird looking shape that looks like a comic style "KAPOW!" word art.  To give the labels of the various points, the X-axis was added to the worksheet again with a dual axis and shows the simple, small shape and the labels of each skill.

A background was inserted (after being created in photoshop) with the rings in sizes of 10 for the chart.  For added flair, the comic word art frame was added to the back of the circle.  Sticking with this theme, a second sheet was created to provide the same style of the chart for the "vs" character.
The word art frame 

The final background for the chart

Lastly, labels were added to the dashboard explaining the allegiance of the character, name, and overall rating.  The overall rating was comprised of the average of all six (6) skill scores.  The data set came with a cumulative total, but calculating the average provides a better-rounded picture of the status for purposes of this data viz.

The final trick was to assign the filters to prevent a master filter from changing all the charts.  This is completed by using the assigned 'filter to worksheet option' in the filter menu.  After the correct assignment of charts, the matchup can begin!

Tuesday, January 8, 2019

Making a Picture with Christmas Spending Data

Idea:

Because I was really feeling the Christmas spirit this year, I wanted to make a Data Viz that looks Christmas-y and fun.  The compiled data spans over a period of several years and utilizes the average dollar amount spent on Holiday gifts to create a festive visual picture.  The idea is to analyze the change in percentage by year in correlation to the amount spent on Holiday shopping.  Using the time series data as a bar chart graphic, I am going display the data as a cityscape with Santa riding in the night sky. You're feeling the Christmas spirit already, I know it. 

The bar chart (representing the years) looks like the buildings of a city backdrop, and the percentage of change (in holiday shopping) represents the stars of the night sky.  The final touch will be adding Santa Claus riding off in the night sky.

Data Viz:



Here is a link to my Tableau Public page.

Insight:

It was especially interesting to make this visual display and be able to see the impact that the 2008 housing crisis had on the average spending on the holiday season.  There was almost a 30% drop in spending.  It amazes me how something like a mortgage industry crash can affect all other aspects of economics, including the amount spent on gifts bought for that Christmas season.


The other informative insight can be found when looking at the pattern over the various years.  Approximately every eight (8) years the amount of spending drops, then it gradually climbs back up again. It is interesting, but it is also encouraging because it shows a pattern that all things are temporary. Good years and bad years tend to cycle through each other, teaching us that we will always have good times to look forward to in the future, even when it might seem otherwise.

Project:

This project was created using a data set for Makeover Monday on Data.world.  The project was to makeover a data viz from Statista.com.  The task was to improve the data viz.  This project I wanted to make a simple Christmas scene with the data.

Tools:

Data:

The data is to variables, year and the average amount spent by a person on Holiday shopping.  The data was taken from Statista.  The average amount is in US Dollars (USD) and is per person.

Data Cleaning:

The data is already really clean.  The only thing needed to do is make some calculations in Tableau.  The calculation will be the change in percentage from the prior year.  It can be a positive or a negative number.  This separation will help decide the color of the stars.

Process:

The data was imported from data.world into Tableau Public. Next to get the percentage of change for the year in a change in percentage.  This was completed by making a new calculation.  In the Analysis menu, under "Create Calculated Feild..." the following formula was entered to achieve this calculation; (ZN(SUM([Christmas Spend In Us])) - LOOKUP(ZN(SUM([Christmas Spend In Us])), -1)) / ABS(LOOKUP(ZN(SUM([Christmas Spend In Us])), -1)) After this, I needed to make to charts, the bar chart, and the plot chart. The bar chart of the cityscape was completed but displayed really high bars.  Editing the "X" axis to give the range of the data.  The range input was fixed at 550 low to 900 high.  This allows for the bar to shrink to reasonable city building hight.  For color, a spectrum of yellow to blue based upon the different from the average helped provide the color difference in the buildings. On to the stars.  The stars are going to be made with the percentage difference in amount spent form prior year.  This provides the different locations of the stars.  The shape used was the standard stock star shape in Tableau.  To get the color, I separated the colors into two, yellow and white.  Anything white is a positive increase and anything yellow is a negative change. Now we have the cityscape and the night sky.  Now to add the flying Santa in the night sky.  This was completed with a simple Google search for a PNG.  This was the image used;

The last-minute add-in was the slide rule for the years.  This will allow for a fun cityscape to change based up years.  This was set directly in the middle of the image to make it ease of use.

Saturday, January 5, 2019

All the Gifts From the 12 Days of Christmas

Idea:

You've probably heard that famous holiday song where some lucky lover is given a gift every day for twelve (12) days of Christmas. First of all, what does this person DO for a living to afford all these gifts?! (Asking for a friend.)

The idea for this project was to make some sort of data visualization showing all of the gifts given to the "true love" over the entire 12 Days of Christmas.  It starts out with a partridge in a pear tree...but every day they get that SAME gift with a new one added for the new day!  Ever wonder how many gifts they get in total?  This data visualization is going to answer that question.  You can not only learn the total number of gifts received but how many on each day and how many of each gift. (Sorry, it does NOT explain what career field you need to pursue similar gift-giving legend status.)

Data Viz:

Insight:

The goal of this project was to take some (very) OLD information find something new and interesting about the data.  The most captivating thing (for my nerd-enhanced brain, at least) is how the numbers created unique patterns throughout the data visualization, creating a pretty cool visual effect.

The total amount of each gift creates pairs based upon the opposite day.  For example, Day 1 and Day 12 have a total of 12 gifts each.  By comparison, Day 6 and Day 7 have a total of 42 gifts each.  This pattern continues with all the days, creating a histogram style display.  After looking at the chart, the gifts have a unique mathematical pattern explaining the total sums.  Each day's gift total increases (or decreases) in even numbers only.


The second thing that caught my eye does not really register in a person's mind when listening to the song, but REALLY stands out in a visual context.  The star chart on the left side displays the sequential order of new gifts each day.  Depending on how you interpret the song, it sounds like our billionaire gift-giver adds new presents every day IN ADDITION to what was already given. For example on Day #2, they gave two turtle doves and a partridge in a pear tree- giving a total gift count of three (3). This means that by Day #12, the "true love" gets to open 78 gifts.  78 GIFTS!   (Please, for the love of humanity, don't let anyone show this chart to my wife.)


The last piece of insight shown on the chart is the grand total of gifts.  After the twelve (12) days, the "true love" has a total of 364 gifts in all.  More than 50% of the gifts are given on the last three (3) days of Christmas.  What does someone do with so many gifts?  (Can they maybe trade some of those Geese-A-Laying or Swans-A-Swimming for some Horse-Power-Under-The-Hood?)


Project:

This project was inspired from the Christmas Season, with Christmas cheer! (Special thanks goes to my wife's endless rotation of Christmas music 28 hours a day starting in November.)  Seriously though, what better way to end the last day of Christmas than with a Data Viz.  Something that says "Merry Christmas!"


Tools:

Data:

Surprisingly, there was no data set regarding this song... UNTIL NOW!  A new data set was created in Google Sheets and shared with the world.

(Google Sheet) (Microsoft Excel) (CSV) (Kaggle) (Data.world)

The data set has three columns; Day number, Gift, and Quantity.

Data Cleaning:

Since the data set was freshly created, not much cleaning was needed.  When building the data set, everything was entered specifically for this data viz project.  Each gift has the day number next to it in a separate record.  For example, day three (3) has three (3) gifts given.  Each of these gifts gets a separate row, which gives a total of 3 rows just for that day.

Process:

The data visual was very simple in idea.  Four (4) things needed to be emphasized in this data viz.  The first was to explain what the twelve (12) days of Christmas exact dates.  This was completed by creating a simple graphic representation of a calendar highlighting the twelve (12) days of Christmas.  The reference used in this data viz was taken from the Catholic Liturgy Christmas Season.  Historically speaking, this specific religious season has twelve (12) days starting on Christmas Day (December 25th) and ending on January 5th.  To explain this in the data viz, an extra text area was created, providing more depth about the twelve (12) days of Christmas.

The second highlight of the data was the grand total of gifts.  Instead of making a chart, the best way to illustrate this data was simply to provide the grand total in numerical form.  This number was placed right under the calendar graphic to help balance the visual.

The third part of the project was to understand how many gifts of each item was given.  The chart started with a bar chart.  After noticing the unique pattern of pairs, it was necessary to include this characteristic in the data display.  The bar chart sorting was done by using the chronological order of the gifts from day #1.  The final image provides a bar chart that resembled a histogram style design.  To focus on the pairs, each matching quantity was displayed in the same color.  And lastly, the name of each gift and the quantity received was placed next to the bar to explain the chart.

The last piece of information added to the display was the total number of gifts given each day.  This was the final piece of information needed to really explain the data of the song in visual form.  Originally, a bar chart was going to be used but it did not balance with the other bar chart in an aesthetically pleasing way.  A line chart was tried next, but it did not separate the day intervals as desired.  The final idea was to use a scatter plot, which provided an overall pleasing look for the information needed.  Additionally, stars were used for the points of the scatter plot, to create that "Christmas" feeling.  This worked out perfectly because it highlights the curve of growth while also separating the days.

100 Years of Plane Crashes

Idea: I discovered a great data set on Kaggle today that really hit close to home for me.  On the off chance that you haven't leisurel...