Analyzing the Game of Survivor -- Collecting The Data (1)¶
The Data of Survivor¶
Recently, I have been interested in the CBS reality/game show Survivor. Part of this has been because of the extra time I have had due to the recent circumstances surrounding COVID 19, and some of this is because it is a show that greatly resembles other games that I enjoy playing (social deduction games, etc.) If you have never watched it, I would highly suggest it! I'm not a fan of reality TV, but this show is a great combination of all of the elements of reality TV and a game show that make it enthralling.
After watching a few seasons, being a data scientist, I began to be interested in some of the results of the show. It seems that certain personalities are more or less likely to win, or to be voted out at any time in the game. Additionally, there seems to be a bit of a juggling act between the importance of challenges, strategy and social games.
This is the beginning of a series of posts relating to Survivor, data that can be gathered from it, analysis we can make with some of the sources, and perhaps ending in a few machine learning or statistical learning applications (to try to estimate the season winner, or who will be voted out on which days, etc!)
A quick primer on Survivor¶
To give a brief overview of some of the terms and information I will be using throughout this post, I have written a brief summary here. From wikipedia:
[Survivor] features a group of contestants deliberately marooned in an isolated location, where they must provide > food, water, fire, and shelter for themselves. The contestants compete in challenges for rewards and immunity from elimination. The contestants are progressively eliminated from the game as they are voted out by their fellow-contestants until only one remains to be awarded the grand prize and named the "Sole Survivor".
In Survivor, there are a few things of note. First:
Survivors must survive in the elements with one another as a tribe. Much of the game focuses on them being able to get basic necessities. However, it is hardly a game about survival in the wilderness, despite the name. The majority of the game is a social and challenge-based game.
Every episode, there is one elimination from the game. Contestants vote players off by majority votes.
In the beginning of the season, there are two or more tribes. These groups are separated from one another, and perform in challenges collectively.
Reward challenges provide a chance for a contestant (called castaways by the show) to win luxury prizes, like food, survival gear, or other items of interest. As the game progresses, the rewards usually get more and more enticing.
Immunity challenges provide a strategic advantage in the game. When there are two (or more) tribes, winning an immunity challenge allows the whole tribe to avoid elimination -- someone from the other tribe must leave. When there is one tribe, immunity challenges are individual, which prevents that person from being voted out.
The game changes substantially when the two (or more) tribes are merged into one, usually around halfway through the game. This is known as the merge and from this point on, it is an individual game.
Players tend to work in groups (or alliances) to achieve goals in the game. They can act as a voting block.
The game of Survivor is edited after the whole season is filmed -- meaning the people editing often know who the winner will be. This is important to keep in mind.
Additionally, each episode there are some number of confessionals -- which are segments where contestants speak to the camera away from the other contestants. They may complain, for instance, about a particular contestant, or discuss other parts of the game.
While there is more to the show than just this (and it has evolved quite a bit over time), for our purposes it is enough to understand these basic ideas of the game before jumping in. If you're curious, Season 7 (Pearl Islands) is one of my favorite seasons and is a great place to start!
Detail of: The ETL Process¶
The first step in this process, of course, is getting the actual data to use. To this end, I have actually gone quite far in collecting and creating a workflow to update data as the seasons start. This behind-the-scenes work is a bit more data engineering than data science, but is probably the most important part of any of the analysis I will be doing!
ETL stands for Extract Transform Load which are the three steps to building a data pipeline. This general structure can generally describe any process using data. Most of the time, we spend a lot of time as data scientists considering the Transform part of this equation.
You may not realize it, but any data analysis work can usually be abstracted to fit into these broad three categories. In many of the toy examples, both on this site and elsewhere, the extract portion is limited to loading from an Excel or CSV file, and fitting the model. However, in real life (and in any job as a data scientist!) the method by which we gain access to the data, and which form it is in, is of crucial importance.
Don't think this is part of Data Science -- think again! You may have heard the line that more than 80% of data scientists' time is spent cleaning the data. While the reliability of this result is somewhat questionable (check out this survey for more recent results from 2018) it is without a doubt one of the most time consuming and important parts of a data scientists' toolkit.
Without good, clean, and easily accessible data, all of the analytical muscle and machine learning techniques you may use will be to no avail. As the old adage goes -- garbage in, garbage out.
Without further ado, I want to detail some of the goals of this project:
- Find reliable data sources related to as many elements of the game as survivor as I can
- Create reproducible code to pull this information moving forward (...Extract..)
- Connect different data sources together, generate metrics based on them, and clean the data from these sources (...Transform...)
- Load this information into a relational database
- Update this information on a daily basis (as new seasons come out, etc.)
To this end, I have written a series of scripts to generalize this process. Then, I use Apache Airflow and a postgres database to periodically run the code and store it in a database. This is all stored on my homeserver at home. The Airflow tasks are run asynchonically on 5 Raspberry Pis. In the future, I may detail this infrastructure work I have done (which is outside the realm of data science, but may be of interest none the less.)
This whole process took me roughly 1 month. It was well worth it, thought -- the data I have collected is rich and from many different sources, including
- True Dorks data on statistics on challenges (Excel spreadsheets)
- This collection of confessional data from various seasons, with every confessional listed for each episode (Word documents)
- The Survivor Wiki for information about episodes, seasons, contestants, tribes and alliances. (HTML, webpages)
- Pushshift.io data for the reddit r/survivor subreddit, where people discuss all things related to survivor (API)
- Caunce Character Types for each of the different contestants on the show (Excel)
As you can imagine, combining all of these data sources to tell a cohesive story was quite a challenge, but a worthwhile one all the same!
The remainder of this post will be discussing the data that I had collected, how it has been saved, and what I plan to do with it!
Querying the Database¶
I have stored all of the results in a Postgres database, hosted on my homeserver. There is information here that should be kept a secret, like my password, of course, so I will access it using environment varaibles. This was actually saved using the conda
environment I have specified for this project, survivor_scraping
.
import os
from sqlalchemy import create_engine
import pandas as pd
pg_un, pg_pw, pg_ip, pg_port = [os.getenv(x) for x in ['PG_UN', 'PG_PW', 'PG_IP', 'PG_PORT']]
def pg_uri(un, pw, ip, port):
return f'postgresql://{un}:{pw}@{ip}:{port}'
eng = create_engine(pg_uri(pg_un, pg_pw, pg_ip, pg_port))
I will now use this engine, as well as the handy read_sql
function of pandas, to read in some of the tables and show some of the interesting data we have here!
True Dorks Data¶
For the True Dorks data, we have a few different tables that were extracted from here. The first is the episode_performance_stats
table, which details the results from the challenges and other events over the course of the entire episode.
pd.options.display.max_columns = 100
pd.read_sql('SELECT * FROM survivor.episode_performance_stats LIMIT 100', con=eng)
Here, you can see soem interesting information. Notice that these are based on contestant_id
s, episode_id
s and season_id
s. These are foreign keys for the other three tables, contestant
, episode
and season
. This information was pulled from the wiki.
In order to map these to the correct values, there was a semi-automatic process to find the correct contestant, episode and season based on the names. This is why the Transform portion of the ETL process was necessary -- to make sure all of these tables were consistent. While a lot of work, it has made the data much richer than any of these individual sources alone. From here out, you can assume id
columns required matching with other data sources to ensure correctness and to conform with relational database paradigms.
Additionally, for each table you can see a created
and updated
section. This is present in all of the tables. These were updated most recently on July 19th. If the automated process (using Airflow and the Raspberry Pis) updates these tables, we will see the timestamp at which this was done in these tables. It uses an upsert
method, so it should not overwrite the currently used values.
You can see this table details information at an episode level. For instance, if there was a reward challenge and an immunity challenge, this will have the results from both challenges. Some episodes will have multiple of each (especially toward the end) so this is important to take note of. Challenge wins are fractional if they wored as part of a tribe -- for instance, if I am in a tribe of 4 and I win a challenge, I will get .25 challenge wins. Same goes for the challenge appearances.
The remainder of the results here should be relatively self explanatory. More information is listed on the True Dorks site. Thanks, True Dorks!
Additional tables form this source and the immunity_challenge
, reward_challenge
and vote
tables.
pd.read_sql('SELECT * FROM survivor.immunity_challenge LIMIT 100', con=eng)
pd.read_sql('SELECT * FROM survivor.reward_challenge LIMIT 100', con=eng)
The two tables above are pretty self explanatory -- they explain each contestants result in an episode for each particular challenge in that episode.
pd.read_sql('SELECT * FROM survivor.vote LIMIT 100', con=eng)
The above table, also from TrueDorks, lists each vote in ech episode. Note that the contestant
and voted_for_id
both refer to values in the contestant_season
table.
Confessionals Data¶
One of the more interesting data sources here, the confessionals data has the actual words that were said in confessionals by contestants in a select number of seasons. This information could be a very interesting analysis for some NLP work, which I plan to explore in future articles. This was sourced from the google drive here, and was scraped based on the names in the word documents.
In some cases, as you'll notice below, there is not anything mentioned in the content
field. In these cases, it was recorded that that particular contestant_id
had spoken, but not what they had said. While this is a bit disappointing, the counts per episode or per contestant can surely be an interesting analysis.
pd.read_sql('SELECT * FROM survivor.confessional LIMIT 1', con=eng)
Survivor Wikia¶
The information scraped from the Survivor wiki is the following:
- Episodes
- Seasons
- Contestants
- Tribes
- Alliances
- Final Words
- Story Quotes
- Voting Confessionals
Hopefully, this rich data collected by the people at that wiki will be able to be used to good use! In the process, I was also able to identify small edits which I pushed up to the wikia. Hopefully, I'll be able to find even more!
Seasons, Episodes and Contestants are the building blocks of what makes Survivor survivor. Results pulled from the wiki tended to be of the "more is better" mentality, so data from summary and story sections were included as well.
pd.read_sql('SELECT * FROM survivor.episode LIMIT 10', con=eng)
pd.read_sql('SELECT * FROM survivor.season LIMIT 10', con=eng)
Contestants were broken into two tables, to separate the idea of contestants in a particular season vs contestants on the whole. Some recurring characters act very differently season to season, so this information may be useful to separate.
pd.read_sql('SELECT * FROM survivor.contestant LIMIT 10', con=eng)
pd.read_sql('SELECT * FROM survivor.contestant_season LIMIT 10', con=eng)
Something else that may be interesting to look at is the tribes and the alliances. Who is alligned with whom? How strong are they? This could be interesting to look at historically and for future seasons
pd.read_sql('SELECT * FROM survivor.tribe LIMIT 10', con=eng)
pd.read_sql('SELECT * FROM survivor.alliance LIMIT 10', con=eng)
Finally, to compliment the confessionals from above, some episodes have a wealth of quotes from contestants in particular situations.
Final words contains the final words spoken by contestants at the end of the episode:
pd.read_sql('SELECT * FROM survivor.final_words LIMIT 10', con=eng)
Voting Confessionals contain words spoken in the voting area each episode (often censored or redacted for dramatic suspense):
pd.read_sql('SELECT * FROM survivor.voting_confessional LIMIT 10', con=eng)
Finally, story quotes contain information from the entire episode. There may be overlap with the confessionals section, or it may be within the tribes (not considered a confessional). Additionally, when there are no contestants, it may represent voice overs or things spoken by people outside of the main contestants.
pd.read_sql('SELECT * FROM survivor.story_quotes LIMIT 10', con=eng)
NOTE: You may notice that there is a wealth of information here from the wiki itself. While some of this hasn't been cleaned entirely, the text fields so far have been mostly untampered with. What has been cleaned are numeric, or datetime, columns, and relational foreign keys. Until a good use is found for some of the text data, it may not make sense to clean this yet, as I can see there being a lot of information in there that I don't want to slice just yet.
Pushshift.io¶
Pushift.io is a public api for requesting archived Reddit data. The nice thing is that it does not have a small limit to accessing the data, like Reddit's API does, and does not limit on total requests. This data is updated every day, and contains all of the posts since the inception of the r/survivor
subreddit in 2011. r/survivor
is actually a thriving community, and a rich data source for engagement, as well as predictions for the show. We have two tables here:
Submissions contains data on the submissions as topics to the subreddit.
Comments contains data on the comments in the subreddit.
pd.read_sql('SELECT * FROM survivor.reddit_submissions LIMIT 10', con=eng)
pd.read_sql('SELECT * FROM survivor.reddit_comments LIMIT 10', con=eng)
There is a wealth of information that Pushift.io provides here. In terms of connecting to the other tables, we have a most_recent_season
, within_season
and most_recent_episode
column which relates the date of the post (or comment) to the most recent season/episode in our tables. This will be the subject of our first analysis!
Caunce Character Types¶
Last, and certainly not least, are the Caunce Character types. These are character types assigned by Angie Caunce a reporter on the show on a popular podcast, Rob has a Podcast
. I actually don't know too much about this, but it is talked about quite a bit on the subreddit and other forums. Each contestant is linked to the character type to which Angie has assinged them. These have ids, like the other tables, and look like this:
pd.read_sql('SELECT * FROM survivor.role LIMIT 10', con=eng)
Okay, cool. So What?¶
Totally understand you asking yourself this question, if you've made it this far. So what? Who cares about all of this data?
Well, I wanted to make a post about this since, for the next few months (and possibly beyond...) I will be posting an analysis weekly on Survivor based on these tables. And since it can certainly be confusing, I figured the best place to start is where all analysis starts -- data collection and cleaning. And while the next few analyses may be the most fun to dig into, they will by far be much less time consuming, and ultimately much less important, than this step.
One of the coolest parts about all of this is that the generation of these tables is now put into production in my little mini homeserver environment. Now, my Raspberry Pis will be put to great use scraping and otherwise collecting data from all of these sources every day. Even better, I don't even have to think about it anymore!
As a data scientist, this is a beautiful situation -- I am able to be entirely self sufficient to do all of the cool data science stuff that I want! :)
Also, I'm a huge nerd and hope you are too and hope that seeing all of this data makes you drool! If you would like for me to share this data with you, please feel free to reach out to me!