Thursday, December 27, 2018

Women and Their Crossword Puzzles

Insight:

What is more interesting than knowing not only how many women created crossword puzzles compared to men, but which day of the week is each gender most likely to publish them? Should I point out that this is sarcasm, or did you figure that out already? The one thing really lacking from this project's data set was the age of crossword puzzle contributors. Seriously, I am DYING to know how many people spend their 20's (or 30's) turning crossword puzzle writing into a lucrative career. I'm making a mental note to stock up on #2 pencils right now...and wine, while I'm at it. (Wine improves anything, including boring data sets.)

Anyhow, my number one goal when taking on this edge-of-your-seat project was to take an existing data set and create a radar chart to display that data in an aesthetically pleasing format. 'Cuz nothing bulks up a portfolio better than a wide variety of awesome data viz displays. So let's take a look at which gender writes more crossword puzzles, and when.


2001

When creating a visual display for this data, the biggest trend I noticed was that the polygon shape was mainly on the right side of the graph throughout all of the different years and decades.  This means that most women were creating crossword puzzles between Sunday and Wednesday. Does this mean that women are the most bored during the first half of the week? Or is that when women are most productive? Considering that they have to put up with us men throughout most of the weekend, it's a wonder they don't use that time to practice digging holes in the backyard. (P.S. If you don't get that joke, you probably aren't married.)

It should also be noted, however, that the male gender still dominated the overall percentage of crossword puzzle contributions.  Men are far more likely to create crossword puzzles when analyzing the gender of contributions. When you think about how many movies and tv programs show old men reading the paper on weekend mornings, it makes sense that men would be the ones writing them, too.

The years 2001 (see chart above) and 2008, however, had at least 50% contributions by females on "Monday."  (So this would support the theory that women are particularly productive on Mondays.) It was surprising to see when most women made crossword puzzles, and how often they did so.

2008

Idea:

The idea for this project was to create a radar chart using existing data.  This dataset is perfect to try with Tableau Public Account.  Tableau does not offer a radar chart with the "Show me" tab.  This would be a more formulated type of chart.  The idea was to use a black and grey monochrome color scheme to see the time difference in the way women worked on the crossword.

Data Viz:

The link to the workbook for download is on my Tableau Public portfolio.

Project:

This was a data set for Makeover Monday on Data.world.  The project was to makeover a data viz from XWord Info.  The task was to improve the data viz.  This project was to gain insight on how many women constructed a crossword puzzle in the Shortz Era and what day of the week they had published work.

Tools:

Data:

The dataset comes with the year and the total amount of gender type that worked on the published crossword puzzle.  It was broken into male and female.  The last bit of information in the dataset was the day of the week.  The years range from 1993 to 2017. Again, I feel like I should note that it would have been awesome if the data set included the age of the contributor. (How many years worth of pencils I should order, exactly?)

The data provides a time series which is perfect to display.  It also shows the amount of difference in the value of men and women who contributed to the crossword puzzle.

Data Cleaning:

The data cleaning process was extremely easy after syncing the data set from Data.World to the Tableau Public interface.  It worked really great to fetch the data with no problems.

Process:

Since the data is in decimals and not a percentage,  a new "Calculated Field..." needs to be made to get both the percentage of women and percentage of men.

% of Men = Men ÷ (Men + Women)

% of Women = Women ÷ (Men + Women)

The percentage of a whole would provide a standard value across all the data to match up the radar chart.  Matching up the values made it easier for the data to be interchanged without a problem.  With this being a gender comparison the percentage type of value is perfect.

The hardest part was trying to figure out how to make a spider chart in Tableau since they do not have it as an option.  This required a Google Search.  The code used for the "X" axis is:

case [Weekday]
WHEN 'Sun' then 0
WHEN 'Mon' then [% of Women] *SIN(2*PI()/7)
WHEN 'Tue' then [% of Women] *SIN(2*2*PI()/7)
WHEN 'Wed' then [% of Women] *SIN(3*2*PI()/7)
WHEN 'Thu' then [% of Women] *SIN(4*2*PI()/7)
WHEN 'Fri' then [% of Women] *SIN(5*2*PI()/7)
WHEN 'Sat' then [% of Women] *SIN(6*2*PI()/7)
END

This was easy to read and can use this formula later for radar charts with more than seven (7) or less than seven (7) sides.  The same was for the "Y" axis for the data as well:

case [Weekday]
WHEN 'Sun' then [% of Women]*COS(0)
WHEN 'Mon' then [% of Women] *COS(2*PI()/7)
WHEN 'Tue' then [% of Women]*COS(2*2*PI()/7)
WHEN 'Wed' then [% of Women] *COS(3*2*PI()/7)
WHEN 'Thu' then [% of Women] *COS(4*2*PI()/7)
WHEN 'Fri' then [% of Women]*COS(5*2*PI()/7)
WHEN 'Sat' then [% of Women] *COS(6*2*PI()/7)
END

Starting with this calculation, it was necessary to play around with the days of the weeks to get "Sunday" at the top of the radar chart.  The flow was clockwise, which made "Monday" next on the right side.

The second challenge was making the scale.  The shape of the polygon was copied and input into Adobe Photoshop.  The outline of the polygon was used to make the background for the chart.  They were reduced in sizes of 75%, 50%, and 25% of the shape centered on the original.  This gave the correct size dimensions for the percentage lines for the chart to follow.

Once this shape and graphic were created it was set as the background of the chart.  After a few trial and error corrections on color, it was surprising at how easy and on the point that was for a background image.

The next thing was to create a "Year" filter with a button selection on the right side.  This makes the data interactive with the user.

The data sheet was duplicated 3 times to create an overview of the time change.  The first one was filtered from "Year" 1993-1999 and labeled as the "1990s".  The second was filtered with "Year" 2000-2009 and labeled as the "2000s".  And the third in a similar fashion, label as the "2010s" for the "Year" 2010-2017.

These three little time displays were placed along the bottom and made to not change with the interactive filter.  It was to show the change in time over the years based on an Average of percentage.  

The last thing was creating the color scheme.  The idea was to look like a crossword newspaper puzzle, which went with the black monochrome and a yellow highlight.

NEDOC Score Project Using R

Project:

First of all, let me just say that I am not familiar with medical data, which made this project more interesting to complete, although, without certain data (such as cost variables), it also made it more challenging, as I wanted to find potential solutions with only the data provided.  The general goal of this project was to find out how to reduce the Emergency Department Overcrowding in a particular hospital.  The specific goal was to discover whether the Emergency Department needed more beds, the Inpatient Care needed more beds or if there were any other factors that could help reduce overcrowding issues, such as staff increases.  The data set provided has three (3) sheets which have information about the transfer of patients in an Emergency Department.  All the privacy information was removed from the data sets. As noted previously, there was no cost data included, therefore the solutions had to be devised using only the available data, limiting the scope of the project.

Idea:

The idea was to take an existing data set and apply it to the statistical programming language R, then to create a visual presentation about the findings.  To do this, I would use both regression and multi-regression model building in R.  My goal is to build a model that will help identify the causation of the overcrowding using the provided data set.  Once the problem is detected, it would then be possible to develop a plan that will improve the problem.

Tools:

  • R Studio
  • Microsoft Excel
  • Microsoft PowerPoint
  • Google Image Search

Data:

The data set provided was from a working hospital's Emergency Department transfer in a one-month period.  The name of the specific hospital was withheld for privacy reasons. The period given for the data set was May 2016.  The data was provided within three (3) sheets, and each sheet provided different insight regarding the overcrowding situation.

Datasets

Data Cleaning:

In order to clean the data, you first want to review it all to gain an understanding of what the various data means. For this project, this required some basic research in understanding what Emergency Department overcrowding is defined as. A commonly used term which was addressed in the data set was the 'NEDOC Score,' so research also needed to be done to gain an understanding of this and other necessary terms. I summarize this briefly below:

What is Emergency Department Overcrowding:

Emergency Department Overcrowding, defined by Mohammad H Yarmohammadian, Fatemeh Rezaei, Abbas Haghshenas, and Nahid Tavakoli in an article written for the Journal of Research in Medical Sciences, is "the situation in which ED function is impeded primarily because of the excessive number of patients waiting to be seen, undergoing assessment and treatment, or waiting for departure comparing to the physical or staffing capacity of the ED" (2017).

To summarize for us non-medical folks, the definition of 'Emergency Department Overcrowding' is anything that causes a larger-than-normal number of patients to be waiting for Emergency Department care- including check-in, receiving care, and check-out of the emergency department. For example, when you go to the emergency room with a broken arm and end up waiting two hours to be seen, an hour in the doctors 'room' to get the x-ray and cast, and another hour just to get the paperwork to go home- this is considered ED overcrowding.

With this understanding in mind, our goal for this analysis is to identify what causes overcrowding in the Emergency Department and potential ways to address the issue effectively. On a side note, overcrowding in the Emergency Department is a worldwide epidemic.  It is worth a search on Google for some interesting background knowledge if you are curious and have a few spare moments.

What does NEDOC Score Mean?

A commonly used term in the data set is the 'NEDOC Score.'  This is a foreign term for anyone that has no technical background in the medical field. So back to Google, we go!  (Side note: it is important to make sure that any research is done through Google and other sources are peer-reviewed and reputable. For example, a Men´s Health article would not be a sufficient source for this research, whereas the Journal of Community Medicine & Health Education would be a perfect resource.)

The NEDOC Score is a standardized way to measure the severity of Emergency Department Overcrowding.  Essentially the level of overcrowding is scored based on specific NEDOC criteria, giving an overall picture of the severity of the problem. Here is a great article to give a basic understanding of the NEDOC Score and it's various components.  The PDF file breaks it down into different types of category labels.  A NEDOC Score over 100 is classified as an Overcrowded status.

Combining Datasets

Now that we have a basic level of understanding in regards to the data and necessary terms, we can move on to the data cleaning of the three (3) sheets.  This process was completed in Microsoft Excel.  Excel makes the process quick and easy for inputting into R.  The goal for cleaning this particular was to get all of the data into one sheet.  In order to complete this task, the data for each individual sheet needs to match.  Unfortunately, the data use different time intervals on the various sheets- so it was important to adjust the data according to a standard time measurement without affecting the integrity of the data. The time intervals on the Components sheet is broken down into one (1) hour increments.  The NEDOC Score is broken down into intervals of 15 minutes.  The Patient Move sheet is broken down into the time of check-in.  Since the Component sheet has the largest time intervals (one hour) and includes most of the required data, this can be the primary data sheet, and the other sheets can be adjusted to match the Components sheet.

Starting with the NEDOC Score sheet, I needed to find a way to match the data to a one-hour status.  This was accomplished by combining the four (4) separate fifteen (15) minute entries into one (1) single hour entry.  To do this effectively, I used the worst case entry for each of the 15 minute periods in the given hour as the representative measurement for that hour overall.  It was important to use this method in particular, because taking the average measurement of all four intervals (for example) may compromise the data and overlook an overcrowded status within that period.  This data was then effectively reduced to hourly intervals and assigned by the highest NEDOC Score per hour.  After this was completed, the data was copied over to the Components Sheet.

The Patient Move Sheet was also combined into hourly intervals. This sheet shows the patient check-in time.  The adjustment process was much more simple than with the NEDOC score and was created by counting the number of check-ins that were completed within a one-hour period.  This required a quick count measurement.  I chose to include this information because it had a direct tie to Emergency Department statistics in relation to overcrowding.

Assumptions 

It must be noted that the data provided was quite limited. As I stated above, there was not any cost information. For example, what does the average hospital bed cost? What does the average hospital staff worker cost? While it would be nice to have more inclusive data to help strengthen the models, as an analyst you need to work with the data you were given.  (In some cases, you can request more information from your data source or client, but this is not always available.) To overcome this 'lack of data' problem, an analysis needs to have some general assumptions.

The assumptions of this data:
  • Only one person can check in the Emergency Department at a time.
  • Patients waiting for an Inpatient Bed means that the Inpatient Care unit is full.
Once all the data was compiled together into hourly intervals, a master dataset sheet was created.  This file was saved as a CSV file and ready for analysis in R Studio.

Process:

Is Overcrowding A Problem?

The first step was to discover if an overcrowding problem even existed.  This was completed by reviewing the how many times a NEDOC Score over 100 was assigned.

percentageofOvercrowded <- sum(HosData$NEDOC.Score.Over.100) / length(HosData$NEDOC.Score.Over.100)

This resulted in an overcrowded status 41.8% of the time.  To find this percentage, I divided the number of hours qualifying as an overcrowded status by the total number of hours in the month. The month of May had a total of 744 hours in 2016.

It is safe to say that this hospital data set has an overcrowding problem in the Emergency Department.

Problem Areas

The second part of the analysis is to find out which factors have a statistical significance to the NEDOC score reaching over 100.  (What changes are most likely to affect or reduce the NEDOC score?) This was achieved by running correlation models in R against the NEDOC score.  

Code:
summary(lm(HosData$NEDOC.Score.Over.100~ HosData$X..of.ED.Pts.Waiting.IP.Bed))

Results:

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7765 -0.4041 -0.2946  0.5301  0.8149 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         0.185107   0.039851   4.645 4.02e-06 ***
HosData$X..of.ED.Pts.Waiting.IP.Bed 0.021904   0.003362   6.515 1.34e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4803 on 742 degrees of freedom
Multiple R-squared:  0.05411, Adjusted R-squared:  0.05283 
F-statistic: 42.44 on 1 and 742 DF,  p-value: 1.341e-10

Code:
summary(lm(HosData$NEDOC.Score.Over.100~ HosData$X..of.ED.Pts))

Results:
Residuals:
     Min       1Q   Median       3Q      Max 
-0.90931 -0.30983 -0.03864  0.30836  1.03273 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -0.4466515  0.0485897  -9.192   <2e-16 ***
HosData$X..of.ED.Pts  0.0142732  0.0007633  18.700   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4072 on 742 degrees of freedom
Multiple R-squared:  0.3203, Adjusted R-squared:  0.3194 
F-statistic: 349.7 on 1 and 742 DF,  p-value: < 2.2e-16

Code:
summary(lm(HosData$NEDOC.Score.Over.100~ HosData$X..of.ED.Pts + HosData$X..of.ED.Pts.Waiting.IP.Bed))

Results:
Residuals:
    Min      1Q  Median      3Q     Max 
-0.9080 -0.3118 -0.0374  0.3101  1.0313 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         -0.4482021  0.0502455  -8.920   <2e-16 ***
HosData$X..of.ED.Pts                 0.0142318  0.0008354  17.036   <2e-16 ***
HosData$X..of.ED.Pts.Waiting.IP.Bed  0.0003818  0.0031191   0.122    0.903    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4075 on 741 degrees of freedom
Multiple R-squared:  0.3203, Adjusted R-squared:  0.3185 
F-statistic: 174.6 on 2 and 741 DF,  p-value: < 2.2e-16

After running these correlation models, we are able to show that Inpatient Beds do not have a statistical significance in the NEDOC score.  The Emergency Department has an extremely low P-value, which will conclude the analysis to focus on the Emergency Department Beds and disregard the data regarding Inpatient Beds.

In non-medical terms, when a patient checks into the Emergency Department, they are given an Emergency Department Bed. Next, patients are then either checked out and released, or checked out to an Inpatient Bed in another section of the hospital, effectively leaving the Emergency Department in either case. Therefore, the number of Inpatient Beds does not greatly affect the issue of overcrowding in the Emergency Department, since that is a separate division of the hospital. The Emergency Department Beds, however, DO have a significant impact on the NEDOC score according to these correlation models, allowing us to focus on that factor for the purposes of this project.

Model Building

Lastly, we need to build models that can be used to help address the problem of overcrowding.  The goal would be to reduce the overall percentage of time that the NEDOC Score falls within an overcrowded status.  Since no monetary value was provided with this dataset, a range of suggestions will work best to explain what would happen in a "what-if" scenario with the models using the data provided.

Code:
ModelFinal<-(lm(HosData$NEDOC.Score.Over.100~ HosData$EDBedsRemain + HosData$X..of.Critical.Care.Pts...display + HosData$Longest.Admit.Time.Waiting.in.ED + HosData$Count.of.Patient.Entered.in.Hour + HosData$Sunday + HosData$AfternoonShif))

intercept<- summary(ModelFinal)$coefficients[1,1]
CofRemainBeds<- summary(ModelFinal)$coefficients[2,1]
CofCritialCare<- summary(ModelFinal)$coefficients[3,1]
CofWaitTime<- summary(ModelFinal)$coefficients[4,1]
CofPatEntred<- summary(ModelFinal)$coefficients[5,1]
CofSunday<- summary(ModelFinal)$coefficients[6,1]

CofAfternoon<- summary(ModelFinal)$coefficients[7,1]

HosData$FiveNewEDBeds<- (HosData$EDBedsRemain + 5)
HosData$TenNewEDBeds<- (HosData$EDBedsRemain + 10)

HosData$FifteenNewEDBeds<- (HosData$EDBedsRemain + 15)


Data Viz:

The visual of this analysis was best prepared as a presentation.  Theoretically, the goal would be to present these findings to an executive of the hospital for review.  This presentation was completed as a Microsoft PowerPoint presentation for ease of access and use to a wider variety of viewers who may not be familiar with other platforms.  The presentation was recorded and saved as a video to be sent to the hospital stakeholders.





Insight:

The final insight was given in more detail during the presentation, however, the main points of the findings were that the Emergency Department Beds were the largest cause of the overcrowding problem and that increasing the number of beds above 72 would effectively help reduce the NEDOC Score.

Additionally, we explored the factor of available staff during busy periods in the Emergency Department. The goal for this was to help build an understanding of when more staff would be needed to help offset the overcrowding issues.  The dataset shows that the afternoon periods (4 pm to 12am) each day and Sundays (all day) would greatly benefit from more staff.  These time periods provided a 'statistically significant' that increased the NEDOC Score.

Suggestions to Reduce Overcrowding:
  • Increase staff during afternoons
  • Increase staff during Sundays
  • Increase the total number of Emergency Department Beds 

Thursday, December 20, 2018

Women and Diversity in Tech

Idea:

The idea is to highlight the difference in gender diversity within the designs.  In this design, all the bar charts are a little overwhelming and could easily be reduced to a comparison bar chart.  The decision to focus just on gender and to make a gender icon that was both a male and female person within the gender icon.  The image had to be made to form the simple male-female icons used on bathrooms in public places.  The images were found using Googles Image search and edited in Photoshop.  Lastly, the ethnicity of the company is also important to highlight.  A stacked bar chart the same height as the 'person' graphic is used to highlight the percentage of diversity based on ethnicity.

The colors were the hardest point to choose from.  The grey background increases the visibility of the Teal and Hot Pink that represent the genders.  The same color palette was used throughout the design.  The location of the plot chart on the left highlights the male and female employee population percentage of each company.  Anything shown in Hot Pink is more females, and anything in Teal is more males.

Data Viz:


Here is the actual Tableau workbook.  *I did NOT make this in Tableau Public, so I do not have it on a Tableau server.*

Insight: 

The biggest insight taken away from this data viz was the cluster of the percentage of male employees in organizations.  Not very many organizations had a majority of females when compared to other organizations.  This type of information is useful to companies that want to increase their diversity within the organization.  Diversity is an important necessity within an organization.  It can provide new and different ways of completing a task.  It is a little disappointing that most of the organizations have not taken a more active role to hire more female employees throughout the organization.


Project:

This was a data set for Makeover Monday on Data.world.  The project was to makeover a data viz from Diversity In Tech.  The task was to improve the data viz.  This project was based on the diversity in large tech companies as compared to other large companies and the USA Population.



Tools:

Data:

The data provided was downloaded from Data.world, which was originally sourced from company and press reports.  It provides information about the race and gender percentage of employees in an organization.  Most organizations included in the data set are high profile tech companies, with some other comparable organizations like US Congress and Kaiser Permanente.

Data Cleaning:

The data came pretty clean.  The cleaning process of the data included mainly filtering and ensuring the data was read as a percentage and not a decimal number.  This was simple to correct in Tableau by changing the format of the number.  This is done by right-clicking on the field and selecting "Default Properties" and then "Number Format..."  In the menu select percentage type.  Remember to ensure the data is in a decimal field or it will calculate the whole numbers as higher than 100%.

Process:

This Tableau project will have custom graphics to make a unique type of chart.  The biggest challenge is getting the person icon size to match the bar chart behind the person graphic icon.

The second was getting the icons on the plot chart to compare all organizations the percentage of female employees.  The icons provided by Tableau were used to make the female icon.  This was difficult to get in the right size and location, but it did provide a great view of clustering for the female percentage of employment.

The last part was to make the ethnicity stacked bar chart.  The information needed to highlight just the race and monochrome variations of the color Teal was filtered.  Different colors would have taken away from the main focus of gender diversity in this data viz.  The bar was annotated with the race and the percentage of employees. This was a little harder to get in the view, but the mouse over provided more in-depth information about the data.

Friday, December 14, 2018

Date Night Cost Around The World

Idea: 

Date night. The idea of going out and having fun with the person you enjoy spending time with most... Who doesn't look forward to date night? To be honest, after having 5 kids and running multiple businesses, we don't see many date nights at our house. So when we do get the chance to go out, we want to make sure to pick somewhere awesome! That said, I have been temporarily banned from picking restaurants, which is actually awesome for me because it prevents that loop of trying to agree on where to eat for 15 hours until we both starve and end up at the same place as last time. 

The idea for this project came from a data set for Makeover Monday on Data.world. Looking over the stacked bar chart visual, I thought it would look better (and be more fun) as an interactive infographic.  I especially wanted to provide a location view of the cities on a map, to really add some visual appeal.  While the original bar chart includes the flag, not everyone understands where the location of these places actually is. I also wanted to break apart the cost information, since some of the cities might have a higher cost in one category but cheaper costs for everything else when compared to all the cities.  Lastly, I wanted to rank the cities based on cost and event.  

The use of Microsoft PowerPoint was used to make this infographic.  Microsoft PowerPoint is a very common tool used throughout the world, and most people do not know that it can do much more than make slides shows.  For example, it can make interactive graphics displays perfect for this type of idea. My goal was to use a simple and widely-used program to show something fun and interactive.

Data Viz:

Here is a video of the graphic:



Insight:

One of the most interesting things was the ranking of New York City in comparison to the other cities.  New York City was ranked in the top five (5) in all date night events.  This makes it one of the most expensive cities in the world to go out on a date! 


Russia had the cheapest BigMac cost out of all the cities.  It was below $3.00 USD, almost half of the price of the United States.  The real question to be asked is if the quality is any good? What kind of magic ingredients do they use to get the prices so low?


Zurich had four (4) of the highest cost events when compared to the other cities.  It ranked the highest in Movies, Taxis, 2 drinks, and Big Mac.  It would be smart when visiting this city to find places within walking distance that are not fast food or cocktails.  Maybe a small restaurant with local charm? Definitely bring some extra cash to Zurich!


It was surprising to see that Hong Kong was about 75% more costly than Mexico City.  Mexico City ranked lowest of all the cities in most of the date night events.  While these are arguably on opposite ends of the cost-of-living spectrum, a difference of 75% still seems quite drastic for big cities. Mexico City's highest rank was #11 for the Big Mac.  The total night out in Mexico City was just over $100.00 (USD). You're basically guaranteed a fun night without breaking the bank. Awesome!


Project:

This was a data set for Makeover Monday on Data.world.  The project was to makeover a data viz from Statista.  The task was to improve the data viz.  This project was based on the cost of a night out in different cities around the world.

The original data viz:

Tools:

  • Microsoft Excel
  • Microsoft PowerPoint
  • Google Image Search
  • Microsoft Paint

Data:

The data originates from UBS and has a dataset available on Data.world.  The data includes 13 cities from around the world.  The category information has seven (7) variables with a monetary value of cost displayed in United States Dollar ($USD).  The data breaks down further into two types of nights out events: party night and date night.

Data Cleaning:

The data cleaning was extremely easy and simple to complete for this type of data viz.  The cleaning was completed in Microsoft Excel.  The removal of the two categories of date night and party night was the main cleaning technique used for the dataset.  This was not needed in the goal of the visual graphic.  

The combining of dinner and two (2) drinks at dinner as a single value will help consolidate the data.  This value was called "Dinner".  The formula used to combine the data into one value was completed by adding the cells together.

Dinner = Drinks at Dinner + Dinner

Process:

The biggest part of creating this data viz was finding the right graphics to use for the build-up of the layout.  The use of a world map SVG file a bright green and aqua color scheme gave it a perfect visual effect.  The cartoonish looking map was simple enough to provide a fun but simple way to display the cities around the world.  The SVG file was found on Google image search, using the 'labeled for reuse' filter under Tools.  A few edits were made in Microsoft Paint.  

The second challenge was finding flags of the countries.  Originally, I wanted to have push needles in the map with the flag attached at the end.  This looked over-complicated and was simplified to just the flag at the city location.  Each city was located by visually matching up the cartoon map location with its corresponding location on Google Maps. While this was not exact it gives a fairly decent representation of location.

The last thing I needed for the display was icons to represent each type of cost.  Essentially, I wanted to find simple, SVG cartoon graphics with an icon-style to represent each event of the night.  These were found using Google image search and finding a PNG file without backgrounds using the same filter options as before.  

With all the images gathered, it was time to build the data viz.  The layout was constructed to display:
  • The location on the map
  • The cost of each night event
  • The total cost of the night
  • The ranking as compared to the other cities in the data set
The break down of each cost was placed on the left side with an arch type of display, with the total cost of the date night in the center of the arch.  The bottom of the graphic shows the ranking of each cost when compared to the other cities, ranking from #1-13.  (This was completed based on the night event type.) It was possible for a city to have two categories with the same ranking. For example, Zurich ranked #1 in the Big Mac, 2 Drinks, Taxi, and Movies categories. The highest ranking category for each city was highlighted in yellow.  

Lastly, I wanted the map flag to become clickable to send the side view to the specific city data when the corresponding flag was selected.  This was completed using Microsoft PowerPoint.  The process was relatively simple. I created an anchor point on each flag location, which jumps the viewer to the slide for the corresponding city's data.  It was necessary to do this on each slide, so that all slides were inter-connected. This allowed for the interactive change of data during the presentation view of PowerPoint.  When looking at the presentation, it is viewed as one single page, because only the data on the sides and bottoms change, giving the appearance of an interactive visual.


Microsoft PowerPoint is great when saving as a graphics file.  You can save your work as a video, a presentation, a PDF collection, and even a JPEG of the data viz infographic, by simply changing the "Save File As..." as something else.

    Thursday, December 6, 2018

    How Much of the USA Population Can Your State Feed?

    Idea: 

    After reviewing the data viz, I thought it would be better to explain the data as food that can be produced by each state.  I felt that adding in the states' food possibility also would give the viewer a size reference and outcome of the dataset.  

    Data Viz:

    Insight:

    The insight taken from this data was how much of the US population could be fed if a state was dedicated to using the land just for growing food.  The amount of food that could be used to feed the US population for a few years was very surprising.  Most states could grow a crop like corn and wheat, and provide enough food to feed the US population for more than a year.
    Feed American Insight

    Project:

    This was a data set for Makeover Monday on Data.world.  The project was to makeover a data viz from Our World in Data.  The task was to improve the data viz.  This project was based on food and land size required to make 1 gram of protein.


    Tools:

    Data:

    The data given was from Our World In Data.  This came with 10 food type categories and the amount of land needed to produce 1 gram of protein.  A very simple and small dataset.

    In order to add the state size data, I had to grab data from the US Census Bureau website to get the data on state land size.  The data came in a choice of including water and just land without water.  Since the data is about growing food just the land data was chosen.

    The last piece was to find out the population of the United States.  Data from Google which came from the Census Bureau stated a total of 325.7 million people in the United States.

    Data Cleaning:

    The cleaning of the data was extremely easy.  The data of the amount of land used for food was already clean.  It had the type of food and the amount of land needed in meters squared (m²).

    The state sizes need some simple cleaning.  The lands that are not considered a state were removed from the set.  The District of Columbia (Washington, D.C.) remained with the data set.

    The data was imported into Tableau Public with a left join to the state size and the food data set.  In order to get them to join a "Year" criteria in the "State size" data set was added.  This allowed for the "Join" in Tableau Public by matching the "Year" for both data sets.  The year used was 2017 for all of the data sets.

    The other part of cleaning was the conversion of square miles into square meters.  This was handled by using the following formula;

    Meters ² = Miles ² X 2,590,000

    This formula was used in Tableau as a "Created Calculated Field..." to the dataset.

    With the food value in "Protein", the goal is to find out how much protein the average person needs a day.  This was men at 56 and women at 46.  The value of 20,000 grams of protein a year per person was used, which was easier math and in the middle of the men and women values.

    The value of needed protein for a human was taken for a period of a year would result in feeding America in a year period.

    Grams of Protein a State can Produce = Amount of Land Needed per Gram of Protein ÷ State Size in Meters ²

    % US Population That Can Be Feed = (Protein Produced by State ÷ 20,000 Grams of Protein per Person per Year) ÷ 325,700,000

    Process:

    The layout of the data viz shows the state shape and the name on the left side.  The right side is the food with a graphic representation.  The values displayed are based on the percentage of the US population that can be fed if all the land in the state was used to grow food.

    The original bar chart is located in the bottom of the visual to display the massive difference in land used for animal raising compared to the in-ground grown food.

    The top has the state for the selector.  A drop down was used for aesthetic reasons.  The original idea was to use a slider, but the slider interface made it difficult to find a particular state.  The drop-down menu was able to provide this type of feature.

    10 Years of Marvel Comic Move Magic -Part 2 (How Much Was James Gunn Worth to MCU?)

    Idea: Welcome to the second part of the MCU analysis project.  I know all you awesome readers have been dying with anticipation to see w...