Thursday, January 24, 2019

10 Years of Marvel Comic Move Magic -Part 2 (How Much Was James Gunn Worth to MCU?)

Idea:


Welcome to the second part of the MCU analysis project.  I know all you awesome readers have been dying with anticipation to see what else I could possibly say about the Marvel film franchise...but never fear, I am here. (Puffs chest out, waits for the wind to blow cape dramatically behind me...waits a little bit longer...)

Well anyway, let's get down to the data. I know some folks are a little upset about the firing of the Avengers Director, James Gunn.  I am not here to debate whether I personally agree with the decision or not (you know how I like to save the theatrics for social media, I just can't resist those popcorn memes). BUT- I thought it would be fun to take a look at HOW MUCH he was actually worth to Disney's MCU franchise.  Show me the money!!

For those who have not heard the dramatic news, James Gunn was the director of BOTH the Guardians of the Galaxy (GOTG) and Guardians of the Galaxy Vol. 2.  Let's take a quick jump back at the previous analysis (Part 1) to grab some helpful data. The GOTG franchise so far has pulled in $1.63 billion at the box office and made over double per dollar spent. Unfortunately for this amazing series, however, James Gunn was let go by Disney in mid-2018 due to an unexpected scandal. This means that he will NOT be returning to direct Guardians of the Galaxy Vol. 3.  (But don't fret, Gunn was asked to take over the DC Suicide Squad films, and we ALL know DC could use the help.) The future of GOTG3 just got a little bit shakier, however. So let's find out how much cold, hard cash Disney is potentially risking on this decision... (Was it worth it? Let me work it. Put my thang down, flip it and reverse it.)

Data Viz:




Here is the link to the Tableau Public portfolio

Insight: 


I will try my hardest to keep my opinion out of this analysis, so y'all can have some bias-free insight here, but if you want to share YOUR opinion on the matter, please feel free to leave a comment below. (I will be sure to pay tribute with my most awesome popcorn memes.) To begin this analysis, let's first discover if James Gunn even HAD any tangible value to the MCU. When comparing James Gunn to the other Marvel directors, Gunn has (on average) maintained approximately the same film budget as what Joe Johnson spent to direct Captain America: The First Avengers.  Now, remember: Captain America was a hard-core profit flop. James Gunn, however, was able to make more than three (3) times more profit per movie than Joe Johnson did. 

While that may already sound impressive, James Gunn also holds the second highest average return with an approximate 146.81% profit on his movies. (Joss Whedon holds the highest profit return with an average of 150.27% for the Avengers films.) James Gunn is one (1) of four (4) Marvel directors who (on average) made over double what was spent on production per movie, which is pretty astonishing considering there has been 13 directors in the Marvel franchise over the last 10 years.


When looking at what James Gunn contributed to the MCU as a whole, he accounts for 9.34% of Marvel's box office sales over the ten (10) years.  With just two (2) movies, James Gunn has ranked third in box office sales, next to Joss Whedon's two films (16.68%) and the Russo Brothers' three films (22.43%).  These three (3) directors ALONE account for almost 50% of MCU box office sales.


Based on these stats, it is safe to say that James Gunn was an extremely valuable asset to the MCU.  James Gunn contributed about 10% of the MCU sales and was able to generate a return (on average) of just under 150% per movie. So to answer our previous question: YES, James Gunn was a strong contributor to Marvel's success. Since we have now confirmed that James Gunn was an asset to the MCU, let's see if we can determine the possible loss they took from the letting go of James Gunn. 

To find this out, we are going to use a simple linear algebra formula. Okay, now what did you learn in high school math? Start writing down the most relevant formula....just kidding! One of the great features of Tableau is that it provides some basic analysis features, including this handy-dandy mathematical formula (Y = mX + b) to find the prediction of the next film.

Tableau uses this formula to create two points using the data from the first GOTG film (first point) and the data from the second GOTG film (second point). Tableau then places a line between the points (known as the slope) that can be used to predict future profitability for the third movie. The top two points in the chart below (shown in purple) represent the box office sales for each film. The bottom two points (shown in green) represent the budget for each film. Now we can determine the potential success of the third movie using both of these metrics (box office sales and budget). Score!

We can also examine this data and see that the trend (that sloping line) shows an increase for box office sales (purple line) and a decrease on the budget (green line).  From a financial view, costs going down and sales going up is ALWAYS good business.  These trends help us to predict what the box office might have been for movie #3 if James Gunn directed the movie. By using these two (2) equations (the purple and green lines), and assuming Guardians of the Galaxy Vol. 3 was still released in 2020 as initially planned, we can predict (an educated guess, but not necessarily a strong one) what the box office sales and budget would have been if no other factors changed. For example, if James Gunn continued as director, the film would have progressed as planned with no unforeseen delays, no need to hire a new director, etc.


These formulas, then, provide a monetary value of James Gunn as GOTG3 director, and from that, we can infer the financial consequences of his removal. For math nerds who want to dig deeper into what determines those awesome green and purple points on the graph, let's break this down into numbers:
  • (Y = mX + b)
  • Box Office Revenues = 3.01667e+07*Year of Release date + -5.99824e+10
  • Budget of Movie =  -1.07667e+07*Year of Release date + 2.19164e+10
We have set values for X and Y for the first two movies. Tableau then calculates the values of b and m. Using those values for each equation (box office revenues and budget), we can then solve each equation (finding the value for Y) for the third movie using our new values for b and m, and the release date of 2020 for X. (Thanks, Tableau!)

Using this simple model, we can predict that the box office sales for Guardians of the Galaxy 3 will be $954,334,000, with a budget of $167,666,000.  This would make the profit $786,668,000 (82.43% profit).  With all other factors remaining the same, James Gunn would have contributed an additional 4.68% to Marvel's total box office sales with GOTG3 alone. James Gunn's total contribution (all three (3) movies) would then make up a total of 14.02% of MCU box office sales.

So now we have the answer to our second question: The firing of James Gunn jeopardizes a potential revenue of 954 million dollars for Disney's Marvel franchise.  Since Vol. 3 is indefinitely suspended, one could say the monetary value of losing James Gunn is almost a billion dollars.  Because Gunn was fired on 'moral grounds,' Disney's core values are worth more than a billion dollars. This is more than most people would make in ten (10) lifetimes.  After learning this, I can say that this must have been a very expensive and difficult decision to make.

Tools:


  • Tableau Public
  • Microsoft Excel

Data:


The data set was found on Kaggle.  The data set covers all the Marvel franchise (MCU) movies from 2008 to the end of 2018.  The data set comes with information about box office revenues, important people involved, movie characters, financials, and dates.
Here is the data set from Kaggle (MCU Movies)

Data Cleaning:


The cleaning fo this data set was completed in Microsoft Excel.  Since the data was fairly simple, it was easier to do a quick clean in a spreadsheet and took about 20 minutes to complete.  A few attributes also needed to be cleaned up for the import in Tableau. The financial information needed to be replaced with an integer to give it a digital value.  This was completed by changing the text and number into the actual dollar amount. I also formatted the directors' names into a single cell.  The data also includes other names of people involved in the film, which will be cleaned up during a later post. The last part of the data clean-up was removing minutes form the 'run time' attribute.  This helped to change the field to an integer (and not a string) format.

Wednesday, January 23, 2019

10 Years of Marvel Comic Movie Magic -Part 1 (Which Movie Posesses the Infinity Stones?)

Insight:


Every time you turn around, it feels like there's a new superhero movie coming to theaters.  The Marvel Cinematic Universe (MCU) is literally taking over the movie industry, one film at a time.  How did this happen?

The idea to do a Marvel Cinematic Universe themed project came to me with the big buzz surrounding the approaching Oscars.  But this data viz is going to be a little different than my previous projects.  I'm going to take the next few posts to REALLY dig down deep into the superhero world of movies.  After getting some amazing (and enlightening) feedback from some pros in the data science world, I thought this project should focus more on analysis and less on the data visual design.  Hope y'all have got your armor on because I am about to blast open some superhero data.

Simply put, I wanted to see why the Marvel Cinematic Universe is dominating the box office- how much money has the franchise made as a whole, and which movies are (based on just this data set) the breadwinners of the Marvel world.  For those people have not yet succumbed to the dark side, you're about to find out what us comic nerds are so excited about!


Data Viz:


Here is the link to the Tableau Public Workbook on my portfolio.


Insight:


Because every superhero needs a back story, let me fill you in on Marvel's money-making movie history: 

The MCU franchise as a whole has done extremely well in Hollywood over the last ten (10) years.  Marvel movies have grossed over $17.5 million dollars in box office sales (which is about 1.75 billion a year in revenues).  Because the franchise has only spent $4.4 billion dollars (about 367 million dollars a year) on making these movies, they have pulled in around $790 million dollars in profit made EACH YEAR!  By the end of year ten (10), the MCU franchise has made a total estimated profit of $13.1 billion dollars.  That is an astronomical amount of money.  Tony Stark is embarrassingly wealthier than Bruce Wayne.



With the Oscar buzz coming and people raving over Marvel's incredible Black Panther, the whole world is waiting with bated breath to see if Black Panther sweeps the awards this year.  (And if you're not waiting in excruciating anticipation, you need to go back to Marvel film #1 and watch them all in proper order, and become obsessed with the franchise like the rest of us not-normal people.) 

The Black Panther has already out-performed all the other Marvel movies, with the exception of the three (3) films in the Avengers series.  The MCU franchise has four (4) out of the top ten (10) highest grossing films ever made, and Black Panther is currently holding its own at film #9 out of #10.  The Avenger films are sitting at spots 4, 6, and 8- because Avatar and Titanic are still defending the top two spots like movie champions. (I'd like to take this moment to nominate James Cameron for GOTG3). In Hollywood 'big-wig' speak, movies are valued based on their return of dollars spent and their box office revenues.  The Back Panther film ALONE earned $5.41 for every dollar spent and took home over $1.3 billion from the box office.  Whoa.  That's a lot of money!



The MCU franchise has five (5) additional movies that also reached over the billion dollar mark, but only two (2) of the six films were first installments.  The Black Panther and The Avengers.  (For those of you who may be more low-key moviegoers, 'the first installment' means the first movie in a series.)


Now that we've learned a little bit about how bad-a$$ the MCU is in terms of box office revenue, let's take a look at the other side...because you can't have winners without losers, and every superhero fan loves an underdog... 

The MCU's film, "Captian America: The First Avenger' performed horribly (not as bad as the Incredible Hulk by Universal, but close) as compared to other MCU movies at the box office.  While I don't think any of our wives were complaining about watching Chris Evans (and his shirtless scene) on the big screen for two (2) hours, us superhero fans were relatively skeptical.  The return on every dollar spent was only $0.71.  The box office revenue only pulled in $370.6 million dollars.  The film was one of the franchise's largest disappointments as compared to the other MCU movies. 


Don't worry though, Captian America did much better for the second and third installments, and little boys were quickly trading in their Iron Man gear for a star-spangled shield. 

One of Marvels' superpowers is the ability to create sequels and follow-up movie that outperform the initial film.  Second and third installments tend to do better in both the return on the dollar spent and box office revenues.  For example, Captain America: Civil War did three (3) times better than Captain America: The First Avenger.  The sequels for both Ant-Man and the Guardians of the Galaxy made about $100 million more than their first movies.  Thor: Ragnarok (the third Thor Film) and Iron Man 3 both doubled (2x) box office sales compared to their first respective movies.  And, while the comic world was largely unimpressed with The Avengers: Age of Ultron (the second Avengers film), Marvel absolutely KILLED IT with infinity Wars (the third Avengers film).  In fact, Inifinty Wars made $500 million more than The Avengers and pulled in over $2 billion through box office sales.    

Based on these trends, Endgame should be the best performing movie of the entire Marvel Franchise.  Endgame is the 4th movie in the Avengers series, and if we use data as our guide, Endgame will easily break the $2 billion mark in box office sales.  (Not to mention they left the previous Avengers movie on a fantastic cliff hanger, so how could you NOT go see it?!)  The Avengers Endgame film is scheduled to be released after Captian Marvel.  Unfortunately, the Captain Marvel Film is set to take place 20 years ago.  Hopefully, this will not be a profit killer- as Captain America: The First Avenger also took place back in time.  Comic nerds tend to get far more excited about the future than the past.


Project:


This project is part of a series of analyses that examines different aspects of the Marvel Cinematic Universe.  The first part was reviewing the movies themselves and their performance at the box office and rate of return, both as a whole and as individual films to one another in this analysis, we can discover outliers and identify which ones should have disappeared after the famous gauntlet "snap" versus films that hold the cinematic power of the infinity stones!

Tools:


  • Tableau Public
  • Microsoft Excel

Data:


This data set was found was found on Kaggel by Rohit Neppalli.  The data set covers all the movies from 2008 to the end of 2018.  The data set comes with information about the box office revenues, important people involved, movie characters, finical information, and dates.
Here is the data set from Kagle (MCU Movies)

Data Cleaning:


The cleaning fo this data set was completed in Microsoft Excel.  Since the data was fairly simple, it was easier to do a quick clean in a spreadsheet and took about 20 minutes to complete.  A few attributes also needed to be cleaned up for the import in Tableau.

The financial information needed to be replaced with an integer to give it a digital value.  This was completed by changing the text and number into the actual dollar amount.

I also formatted the directors' names into a single cell.  The data also includes other names of people involved in the film, which will be cleaned up during a later post.

The last part of the data clean-up was removing minutes form the 'run time' attribute.  This helped to change the field to an integer (and not a string) format.

Tuesday, January 15, 2019

100 Years of Plane Crashes

Idea:

I discovered a great data set on Kaggle today that really hit close to home for me.  On the off chance that you haven't leisurely perused my bio section just yet, let me save you from a few moments of time spent: I have a background as a Flight Engineer on the Orion P-3 aircraft, gained while serving in the US Navy for 8 years.

So imagine my surprise when I found data relating to aircraft- although it is a somber context, it was also really awesome to go back to my flying days for a moment.  During our training, we would learn about various types of aircraft crashes and what went wrong.  Then we would discuss what we would have done differently and the lessons learned from the crash.  This training was quite extensive and very important. You remember that pilot who landed the airplane in the Hudson River like a superhero? Hello, previous military training. (And maybe his 30+ years of aviation experience after the military helped, but it was mostly the military training.) Nevertheless, it is not an understatement to say that the knowledge of previous aircraft crashes and potential cause/effect/correction is a HUGE necessity in the world of aviation.

This data set was great to stumble upon given my previous career in flight, and I was eager to gain more insight into the history of aircraft crashes around the world.  My idea was to take the existing data set and build something that could give viewers insight about various aircraft crashes. I wanted this project to be something that could be used by any person or organization to view and potentially learn from others' mistakes.

This data viz is dedicated to those who have lost their lives in flight and their families.

Data Viz:

Insight:

This data visualization can be used to gain a lot of insight and information about aircraft crashes over the last 100 years. Let's explore the data below.
When looking at how many lives were lost compared to those that survived, we can look at the data chart towards the bottom (Casualties by Year) and see that there are peaks during two time periods that show a significantly higher number of survivors than on the average airplane crash. Why were there more survivors during these two periods? Unfortunately, the data doesn't give us any insight into the WHY of things. That is a problem best left for Google. Perhaps we can spend some spare time looking up the individual crashes and seeing what happened to set them apart? Keep in mind, however, that this is data that spans crashes over the entire world, so we are talking about multiple flights for each recorded year. I think it would be a fair assessment to conclude that in 1998-2000, the pilots manning the controls were well-trained and amazingly prepared for disaster.  This time period had a HUGE number of survivors when compared to other time periods.  This data could be used within the aviation field to examine training methods, average pilot experience, particular aircraft safety, and more in an attempt to understand and further strengthen the abilities of the pilots who managed to secure the safety of such a large number of passengers. This chart also displays two more interesting tidbits of information.  First of all, we can see that the overall amount of crashes starts to go down dramatically around the year 2000.  This is amazing- we are making progress in the long journey of flight travel, making it faster, more efficient and safer than ever. The second thing we can determine from this chart is that more people are starting to survive plane crashes in general- especially when compared to the 1940s to 1980s. In the 40's, we had crashes we can attribute to World War II. In 1958, Pan America (PanAm) launched the Boeing 707 flight from New York to London, creating the availability of commercial trans-Atlantic flights. If Frank Abignale, Jr. was able to walk onto a plane and fly it, no questions asked, then PanAm's credentials really weren't up to snuff and crashes were bound to happen. (If you've never heard of Frank Abignale, Jr., you really need to brush up on your Leo DiCaprio movies, it's one of his best.) All kidding aside, it took a couple of decades for aviation experts to be able to devise aircraft and training procedures that started to result in crashes being minimized and lives being saved. This is great news! We first see a rise in crash survivors, then a trending drop in crashes themselves. Amazing!
The clickable circle on the left side shows all of the airplane crashes side by side. Each crash is displayed as a dot, colored from light to dark based on the number of casualties per flight, whereas the size of the dot symbolizes the total amount of people on board the flight. The first thing we see when looking at this circle is that there are two large, deep-red dots that immediately stand out.  These dots signify a large number of people on board the aircraft as well as a high number of casualties when the airplane crashed.  In the circle, you can move your mouse over each dot to display a summary of the crash and explain what happens.  This is a great way to display a large amount of information in a limited space, while still maintaining a visual appeal.
The chart on the bottom right shows the total number of lives lost due to aircraft crashes.  As the years progress, we see the total count of casualties with a very low casualty total until 1940, and a  rapidly rising slope from 1940 onward. If you look closely, however, you can see that the rise of the slope starts to taper off at the top. This reiterates the fact that aircraft casualties are started to decline significantly compared to previous years. Travel by flight is slowly becoming safer and safer as time goes on.
The chart on the top right is especially useful for frequent travelers. This chart classifies the total casualties by an organization. A viewer can use this chart to determine which airline to avoid for their future trips.  Some things to consider when exploring this data is the amount of time the company has been in the flight business. For example, some of the airlines on this chart are fairly new, but already have a high count of lives lost due to aircraft crashes. (Hint: these companies are at the bottom).  The other interesting name on this chart is Pan America.  They only lasted about 64 years in the commercial flight industry and claimed over 1,000 lives in the process.  (It makes you wonder the real reason they are out of business...and no, Mr. Abignale never crashed his flights.)
The last thing I want to mention this data visualization is the ability to filter the data by Aircraft and/or Organization. You can use the drop-down menu at the top right corner of the chart to select these filters, and the chart will adjust accordingly. The image above shows the data after I applied a filter for the aircraft I was a part of during my time in the US Navy.  (Pop quiz: What aircraft did I fly in the US Navy? If you scroll up, that's cheating.) By selecting the P-3 (or Lockheed Orion, it's a non-military name) as the aircraft type, the results appear as shown above. You might notice that you can see two (2) large crashes which claimed a large number of lives.  The two peaks in 1968 and 1973 show those two large crashes on the bottom left chart.  While you can ask a Navy FE what happened, they will not only tell you the story in detail over a few beers, but they will also tell you what they would have done to save the day. Flight engineers get a lot of the stress, very little of the decision making, and none of the credit. (But I'm biased, of course.)

Project:

This project idea came to me after coming across a great data set found on Kaggle.com from Sauro Grandi. (Thank you, Sauro, you are amazing.)  I saw a great chance to analyze various aircraft crashes and discover if there were any patterns or insight related to the casualties caused by aircraft crashes.

Tools:

Data:

The data can be originally found on Kaggle.com.  The data used is version 4 and downloaded as a CSV.

Data Cleaning:

This data came with a large amount of information.  The only data needed to be was related to the casualties of each flight.  For example, this particular project required the number of casualties per crash, the number of survivors on board, the date of each crash, and the airline and aircraft involved. The existing data set also offered a location of each crash, which needed to be cleaned for geolocation, but was not necessary for this project.  It could easily be cleaned at a later date to analyze the location of the crash if desired.

Process:

The first thing I needed to do was import the data into Tableau Public.  This required the CSV file to be opened in Microsoft Excel and then saved as an Excel workbook because the public version of Tableau does not import CSV type files. (This is a feature of the Tableau paid version, however.) Once the file was imported, I wanted to see the casualties over time using a line chart display.  I added in the total number of passengers aboard on the same axis of the chart, but in a different color.  The "aboard" total was placed behind the casualties graph to show a trend in years where there were more survivors (if any).  This was displayed with the yellow color (total passengers aboard) above the red (fatalities) on the graph. The second chart I created was the bubble chart (the large circle to the left), to show each plane crash in comparison to others.  Because each bubble represents an aircraft crash, the circle was a perfect visual representation of this data.  I assigned a color to each flight based on the severity of casualties, and the size based on the total number of passengers. Large circles that are light yellow signify more survivors, whereas large circles in deep red signify more lives lost in the crash. To create the chart that displays the running total of lives lost by the year (at the bottom right), I wanted to create an area line chart to give a visual representation for the running total of casualties over time.  The starting year for the chart is 1909 and the ending year is 2009.  By using a time span of 100 years, we are able to get a clean and compelling visual. For the next chart, it seemed important to know which organization has had the most fatalities throughout the 100-year time frame.  I created a top ten chart based on the total amount of lives lost in the crashes and color-coordinated the chart to match the rest of the display. The last portion of this project was adding the filtering options at the top right of the display.  The filters used include the type of aircraft, year, and the organization.  The use of these filters helps the user pinpoint any particularly relevant information they may want to see in the chart. Please take a moment to remember those who have lost their lives to these tragic events in history.  They should be remembered and not forgotten.

Sunday, January 13, 2019

Percent of US Population Getting Minimum Wage or Lower.

Idea:

My main goal, for this project, was to see if I could illustrate the common problem of data misinterpretation. The idea for this data visual comes FRESH to you from the most recent Make Over Monday project- Week Three (3) of 2019.  The data visual in its original form is displayed with a blue monochrome color scheme, and color-coded in different shades of blue based on the percentage of people being paid at or below minimum wage by state.  You can see the original data visualization at the very bottom of this post.

Because nothing says 'trolling' like politics, I thought it would be fun to break this data up into time (year) and presidency term.  Before anyone starts panicking, I did not include our most current presidency in this data visualization- we can save that bickering for facebook.

The data included information from 2002 up to 2017.  With this data, I was able to compare the Bush Administration to the Obama Administration in terms of wage levels earned by the American people.  The goal for this project is to display the information by state, year, and political party.  Also, just for fun, I have included a picture of each president before and after their 8-year presidency terms.

Data Viz:

Here is a link to the Tableau Portfolio 

Insight:

One of the most difficult things when analyzing research, including statistics and data evaluation, is to examine and present the data without any personal biases. Here's the catch- data can be misread and used wrongly without too much effort and sometimes even by accident.  We see it all the time in the news- most recently with an interesting debate about whether or not coffee is good or bad for you, for example. (Regardless, I still enjoy about 7 cups of coffee per day...)

Using this chart, let me show you a great example of how data can be misinterpreted.  Let's hop into the DeLorean (did I just age myself?) and travel back to 2008, say around the end of year-ish.  Do you remember what happens in late 2008?  BOOM!  The housing market collapses.  Bush was working hard to stimulate the economy (and soften the blow) with subsidies, payouts, and stimulus checks, among other efforts. It was a rough year. Yeah, now you remember.  On top of all of that, Obama was about to take office in a few short months.  What better way to come into office than cleaning up someone else party mess (pun intended)? Cue the mom from Mrs. Doubtfire, this is her jam.


The data here shows that wages are extremely low during Obama's first two (2) years of presidency, but it begins to trend back down.  What we can conclude from this data is that the effects of 2007 and 2008 put a massive strain on the economy, which we started to see in 2008 but really felt the blow in 2009 and 2010. This shows what a huge impact the housing market has on the overall economy.  Not only were stocks plummeting, but employees' paychecks took a hit- a hard hit.  This is a classic form of economic cause-and-effect. While the Obama Administration was able to reverse the effects throughout his presidency, it was a continual work in progress that took several years.

This being said, it is easy to look at this data and, without thinking too much about the circumstances behind the data, you can see that the first two years of the Obama Administration held the highest level of low wages on the entire chart. If jumping to initial conclusions simply by looking at this chart, it looks like the Obama Administration was a wage-wrecker.

The biggest insight for this data viz, then, is to remember what is happening in your data- what does the data MEAN.  While a data chart might show one thing right away, as the viewer we need to understand the reasons for each reading. This is why you will almost never see a data visualization without some sort of background information accompanying it- not in good data science, and not in good journalism. Even with background information, it can still be easy to misinterpret data. It is our responsibility as data-visualists to try and eliminate that as much as possible and remain unbiased at all times.

Project:

This was a data set for Makeover Monday on Data.world.  The project was to makeover a data viz from Business Insider.  The task was to improve the data viz. This data set is about the percentage of people in the US who gets paid the federal minimum wage or lower.

Tools:

Data:

The data set comes from Data.world.  However, the original data comes from the Bureau of Labor Statistics.

Data Cleaning:

The data was clean and ready for importing into Tableau.  The only change needed was changing the numbers from decimals into percentages.

Process:

The first thing I needed to do was make the small maps of the USA.  I combined the data from all 8 years of each presidency and filtered the image to show the average percentage of the population, by state, that made at or below minimum wage. For example, the average wages of each state during the Bush Administration is displayed in the red map, and the average wages of each state during the Obama Administration is shown in the blue map.  I went with three shades of each color, to uniformly represent the severity of each state, ranked light to dark from GOOD, AVERAGE, BAD.  This was done for both maps, with the darkest color being the lowest (worst) wages. To keep things within the political theme desired, I used red and blue for the political party affiliation of each president. This would not have worked if both presidents had been from the same political party, so I lucked out with that one.

My second goal was to display the change in wages over time in yearly increments.  I decided to use a bar chart and balanced it out visually with vertical rows of stars along each side border, for a stars and stripes theme.  I kept the red and blue color scheme, corresponding to the years of each president.  This split the dashboard up into two (2) sides extremely well for an easy but aesthetically pleasing effect.

The last thing I needed to really set the visual over the top was a picture of each President.  I chose a picture of each president at the beginning of their presidency, and another picture of them at the end of their presidency. I used this to give a fun (but somewhat shocking) visual display of the rapid aging process that occurs when taking on a role as significant as the President of the United States. That is one job that will age you faster than any other.

Overall, it hopefully comes together in a fun and informational visual display, and the original data set has new significance when broken down into the terms of presidency, adding a little extra perspective to the original visualization.

Saturday, January 12, 2019

The Many Flavors of OREO

Idea:

Who doesn't love Oreos? (I mean, apparently some people...but Oreos are definitely one of America's most beloved snacks.) These self-proclaimed 'Wonderfilled' cookies have an obsession with creating a mind-blowing number of wild new flavors...I'm pretty sure I see new flavors everytime my wife sends me to the store. I'm also pretty sure that my wife sends me to the store so I can stock up on junk food and she doesn't have to feel guilty about it. But I'm cool with that.

When I stumbled upon an Oreo taste testing data set, my brain went instantly into midnight snack mode.  This particular data set is from the Famous Kaggale Data Scientist Master Mind, Dr. Rachael Tatman.  Rachael created a very simple survey of twelve (12) Oreo flavors, using 5 taste testers.  The dataset would be perfect for illustrating how to read data and to explain what one would look for when reviewing their data results.  Plus, it talks about 'Americas Favorite Cookie,' the OREO! With a fresh glass of milk in hand, I embarked upon the goal of turning Rachel's awesome Oreo data set into a visual that would be almost as appealing as the cookies themselves.

Data Viz:

Here is a link to the Tableau Portfolio Page

Insight:

This was a great data set to explain what one would look for to make a Tasteful decision.  No more standing in the cookie aisle, unable to decide whether those new Mint Oreos or Red Velvet Oreos would be better. The first image highlights the overall distribution of the data.  As you can see, the data is a left-tailed skewed distribution.  This basically means that the peak of the data is on the left side, with the higher score being closer to 5.  With only 5 taste testers, these cookies were approved by everyone who tried them. (In grocery shopping terms, these are the flavors the whole family will enjoy.)


Next, we have the bar chart of the average score for each flavor.  Since this data was on a small scale from 1 to 5, the average gives an overall picture of the score for each flavor.

This type of chart works great for visually displaying product review scores. The three white bars at the left are the lower scoring flavors.  The light blue horizontal bar is the average score of all the flavors.  The three flavors that score below the average could possibly lead to some interesting information in terms of product approval. Surprisingly, these testers did not like my personal fave, the MEGA STUFFED Oreos. It's sad, I know, but data doesn't lie. (Don't mind me while I drown my sorrows with another row of cookies.)

One thing to look for when creating data visuals is whether or not you have all the necessary data.  In this case, we can assume that data is missing because one of the taster testers only gives responses to 2 of the 12 flavors.  This could lead to skewed results. This is why it is important to pick a MEGA STUFFED sample of taste testers- most say above 30 or more people- to get a more reliable score.  If one of the taste testers is unable to complete the entire survey, it would not dramatically alter the results in a pool of 30 people as it would with only 5 testers.  (Whew, hang in there MEGA STUFFED, we still love you!)


The last chart helps us understand the cookie eaters in a visual context.  This chart shows the average score given by each person and compares it with the rest of the taste testers.  The dashed line sets a parameter for the total average score given by a cookie dunker.  As you can see, the two tasters on the left seem to give lower scores in general.  They have been labeled as a HATER of Oreos when compared to the rest.  (This type of analysis can help you understand how the respondent feels about the review overall, which is a handy tool for data evaluations.)  Since 2 out of 5 give lower scores, we could say that we have more Oreo LOVERS in the overall sample. I hate to keep bringing this up, but these two low-score givers must have also given low scores to the MEGA STUFFED OREOS.  I chalk that up to not having milk. There's simply no other explanation.


Overall this type of analysis is great for finding out what the responses are to a particular survey.  This is an excellent way to view the responses in a grand-scale context and make a better-informed decision for future products.  In this case, Oreo could use this data to see which flavors were a slam-dunk, and which ones crumbled under the pressure. (See what I did there? I'm so punny.)

In this particular data, we could also investigate the three lower scoring cookie flavors and ask why are these lower?  Maybe, for integrity sake, we should compile a group of willing taste-testers to give these flavors a second chance. But please, make sure you have a jug of fresh milk with you, because this could lead you down a long trip down the rabbit hole.

Project:

This project was created to help explain how to visually display a survey based dataset.  The survey type of data compilation is widely used for gathering informational data.  It helps provide insight on making informed decisions on what is working and/or what is not working.  This project will help explain how to analyze this data type to provide extremely useful insight.

Tools:


  • Tableau Public

Data:

This data set was discovered on Kaggle.com and created by the Famous Dr. Tatman (Kaggle Data Scientist Extraordinaire).  It is a survey conducted style of data that is very commonly used to gather intelligence.  This set caught my attention simply because it was discussing OREOS- and Oreos are simply awesome and delicious. Therefore, this had to be done.


Data Cleaning:

The data set had missing data, which was displayed as NULL in various fields and needed to be cleaned to adjust this. This was easily fixed in Tableau by filtering the data.

Process:

This data viz was super simple and perfect for those looking for quick insight into a particular survey.  The data was imported into Tableau.  I first selected the responses, then right-clicked the selection, and chose the "Pivot" option.  This stacks the data perfect for creating a visualization in Tableau.

With the first chart, we needed to figure out the distribution shape of the data. I did this by selecting the "Pivot Field Values" for the values, and then selecting the histogram for the graph option.  Doing this, the 'Milk Dunk' histogram is automatically made.  This one was a left-tail skewed shape.

The Second chart is a bar chart which displays the average scores for each flavor.  The average was used because this type of data was provided in the survey using a scale from one (1) to five (5).  The "Pivot Field Names" and "Pivot Field Values" were placed on the workspace to represent the flavor and score.  More specifically, the "Pivot Field Values" were averaged to get an overall picture of the scores from all five (5) samples.  I added a horizontal line (using the data from the analytics tab) to show a visual display of the overall average (or standard) score. This helps the viewer to easily see which scores are lower and which scores are higher, in comparison to the average score.

The last chart was implemented to help understand what type of cookie eaters tried Oreos.  Simply put, are these taste testers positive or negative in their overall responses? (Are they Oreo lovers, or Oreo haters?) Honestly, I must say that these taste testers do not seem to be very positive, as is evidenced by the horribly disgraced rating they gave to the MEGA STUFFED OREO!  (Hang on, I need a minute, it still hurts.) Okay...back to work. I wanted to display this last chart as a plot chart, but it could have also been created as a bar chart.  For visual purposes, I felt that a plot chart provided a better balance for the dashboard's overall appeal.  The same concept was done with this plot chart as with the previous bar chart (for the flavor scores above).  The plot chart helps us identify what type of people are taking the survey by viewing the taste tester's cookie ratings as an averaged score.  I divided this score into two categories based on whether their average score was higher or lower, giving us two people who "HATE" Oreos, and three who "LOVE" Oreos.

You can use these same principals with any survey-based data set to get a great visual of the information provided.

Friday, January 11, 2019

Etsy Shop Keyword Tag Scraping With Python

Idea:

This project stemmed from the idea of helping my wife with her Etsy shop. She's been running her shop on Etsy for a couple of years now, and to make her life a little bit easier I wanted to use my data skills to help her do some product research. Happy wife, happy life.

My main goal was to try and generate a CSV file with all the keywords tags that were used in each of the top fifty (50) products from a specific search query on Etsy.  I would then be able to use this data to discover what keywords other sellers are using, in hopes to improve Etsy product listings by adding the most popularly used keywords to existing listings (for example on my wife's products). Theoretically, this would improve the odds of these listings appearing in relevant Etsy searches.

**Sidenote: There are now two (2) organizations that have built a great user interface online cloud platforms with this similar type of idea.  Both websites provide this date as well as more insight than what is available with just a simple scrape.  Here are both links: marmalead.com and etsyrank.com**

Data Viz:



Insight:

The best insight for this project was finding keywords to use in a listing.  When reviewing the word cloud, I want to pay close attention to the little words just as much as the larger words.  The little words might provide great long-tail keyword tags that I may be missing.

The other thing we are looking for, that might impact our listing, is the different spelling variations or phrases used in the product listing.  Finding this type of data nuggets can help provide a strong Etsy SEO.  Some search queries may be different from the correct spelling.


I would also stack the words by the count that they show up in the column.  This will give an exact amount of time the words showed up in different listings.  This can provide a must-have in the keyword tags and what words are possibly being underutilized.


Keyword_Tag
COUNTA of Keyword_Tag
Clothing98
Tops & Tees94
Unisex Adult Clothing84
T-shirts84
dance shirt70
gift for dancer50
dancer shirt50
dancing shirt46
dance42
dancer gift38
*This is the top ten words used in the example

This same style of analysis could also be used on the title of the product.  The product title can be thoroughly analyzed to find the best words and how the words are organized in the listings.  This requires an understanding of statistics and how to get the range of the location, but it can be completed.

You can also perform a cost analysis of the products being sold.  This would help you to decide what price point you might want to set your product.

And lastly, you may be wondering why we needed the seller name.  Well, let me tell you it provides great insight.  This information can provide you with how many times a seller shows up on a particular keyword search.  This would be a bigger competitor in terms of ranking.  It would be worth investigating more about the seller to help improve your product listing in Etsy.

Project:

Etsy is an online marketplace where a seller who makes things (such as crafts) can sell their goods to customers.  Etsy is similar to Amazon and eBay but targets the sale of handmade goods and craft supplies.  Their platform allows for thirteen (13) keywords for each product listing, which they call keyword tags.  These keyword tags are what help get the product noticed in the search algorithm.  Each of these keyword tags is displayed on the product page near the bottom.

This project is going to scrape those keyword tags.

Tools:

Anaconda 
Python 3.7
Jupyter Notebook
Pandas
Beautiful Soup
Google Sheets

Data:

The data is from Etsy.com in an unstructured format.  This scrape will be used to build the data set.

Here is the example dataset generated in Google Sheets.

Data Cleaning:

The data cleaning was completed in Google Sheets to help build a quick word cloud.

The data we will be using to gather in this blog post is the search term in Etsy "Dance Shirts."  Here is what the CSV file looked like when I ran the code.

Process:

The coding will be written in Python on a Jupyter Notebook.

In order to scrape a site, I will need to install Beautiful Soup and Pandas packages into Python.  I am using Anaconda, which provides the packages already downloaded for Python.

Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Next is to fetch that data from Etsy.com.  I need to get the website of the page I am going to scrape and look at the HTML code from the site.  The HTML code will be broken into tags and that is what breaks up the elements I need to extract.  What we are looking for is the results from a search query on Etsy, and from that page we will be learning the structure.  This will help give our crawler a list of products.

Code:
# this is the keyword search term for etsy
query = "dance+shirt"
#this is getting calling the page in python
etsy_page_search = requests.get("https://www.etsy.com/search/?q=" + query)
soup_search = BeautifulSoup(etsy_page_search.content,"html5lib")

The list data build needs to be put into a new data frame, which we will build a list for the next part of the crawl to use.

Code:
#This is the listing id list
listing_id = soup_search.find_all("a")
#This holds the listing url
list_id_records = []
keywords_records = []

#this gather listing url by listing id and adding to website address
for listing in listing_id:
    list_id = (listing.get("data-listing-id"))
    if list_id != None:
        url_product = "http://www.etsy.com/listing/" + str(list_id) +"/"
        list_id_records.append(url_product)

After building the code to get the 50 products on the page, we need to open each product page and have the crawler scrape this page.  In this crawl, we the main data we are looking to grab from each page is as follows:
  • Title of the product
  • Name of the Seller
  • The sale price of the product
  • The keyword tags
Code:

 #getting product page information     
for list_id in list_id_records:
    etsy_page_product = requests.get(list_id)
    soup_product = BeautifulSoup(etsy_page_product.content,"html.parser")
    keywords_list = soup_product.find_all("a", {"class":"text-center btn btn-link tag-button-link"})
    for keywords in keywords_list:
        keyword = keywords.text        
        title = soup_product.find("h1", {"class":"mb-xs-2 override-listing-title break-word"}).text
        seller = soup_product.find("span", {"itemprop":"title"}).text
        price = soup_product.find("span", {"class":"currency-value"}).text
        keywords_records.append((title, seller, price, keyword))

The way this data will output is by putting thirteen (13) keywords in one row, making thirteen columns of keywords.  What we want is to have all of the keywords in one column.  By placing them in one column, we also have to repeat a lot of the information again on the listing.  Guess how many times? You guessed it- thirteen (13) times.

The reason for this madness is to make it easier to build the word cloud picture in Google sheets and get a simple analysis of the count of the words.

After we gather all this information we need to store it in a CSV file.  I used pandas to help build the data frame and make it a CSV file.  The CSV file generated by the code will then be used in Google Sheets.

df = pd.DataFrame(keywords_records, columns =["Title","Seller","Current_Price","Keyword_Tag"])
df.to_csv(query + ".csv", index=True, encoding="utf-8")
len(keywords_records)

Inside Google sheets, we want to open our CSV as a Google Sheets file.  This will give the spreadsheet with all the information we needed to scrape.  We will take the keyword column and use that column to build the word cloud.  The larger the word, the more commonly it is used.

10 Years of Marvel Comic Move Magic -Part 2 (How Much Was James Gunn Worth to MCU?)

Idea: Welcome to the second part of the MCU analysis project.  I know all you awesome readers have been dying with anticipation to see w...