Learning to be a code alchemist, one experiment at a time.

Predicting Passenger Survival on the Titanic: Part 1

|

Data Set Provided by Kaggle is about 1/3 of available dataset as .csv file

Each row represents a passenger on the Titanic, and some information about them. Let’s take a look at the columns:

PassengerId – A numerical id assigned to each passenger. Survived – Whether the passenger survived (1), or didn’t (0). We’ll be making predictions for this column. Pclass – The class the passenger was in – first class (1), second class (2), or third class (3). Name – the name of the passenger. Sex – The gender of the passenger – male or female. Age – The age of the passenger. Fractional. SibSp – The number of siblings and spouses the passenger had on board. Parch – The number of parents and children the passenger had on board. Ticket – The ticket number of the passenger. Fare – How much the passenger paid for the ticker. Cabin – Which cabin the passenger was in. Embarked – Where the passenger boarded the Titanic.

Step 1) Think

What might affect survival? Age-Children and grandparents? Sex- Women boarded lifeboats first? Passenger Class - First class is closer to the deck? Fare - Pay more, higher passenger class? Number of family - More people to help, or more people to think about saving? Embarked, ticket and name are less clear but may have an effect?

Step 2) Check Out the Data import pandas titanic = pandas.read_csv(“titanic_train.csv”) print(titanic.describe())

The .describe method only lists numerical data; non numerical data can’t be fed into an algorithm! We’ll fix this in Step 4

Step 3) Clean the Data The age column has 714 entries, to the rest of the columns 891. This means some passengers were missing ages.There are many strategies for cleaning up missing data, but a simple one is to just fill in all the missing values with the median of all the values in the column.We can select a single column by indexing the dataframe like a dictionary. This gives us a pandas series: titanic[“Age”] The fillna method will replace any missing spots titanic[“Age”] = titanic[“Age”].fillna(titanic[“Age”].median())

Step 4) Converting Non-numeric data

Sex is binary - either male of female the .loc finds all the instances in the titanic[“sex”] column that match male; assign those to zero and female to 1 titanic.loc[titanic[“Sex”] == “male”, “Sex”] = 0 titanic.loc[titanic[“Sex”] == “female”, “Sex”] = 1

Embarked is either S,C,Q, or nan (missing). We need to clean the dataset before converting to digits 1-3. Since S was the most common embarkment port, we will assume everybody got on at S. titanic[“Embarked”] = titanic[“Embarked”].fillna(‘S’) titanic.loc[titanic[“Embarked”] == “S”, “Embarked”] = 0 titanic.loc[titanic[“Embarked”] == “C”, “Embarked”] = 1 titanic.loc[titanic[“Embarked”] == “Q”, “Embarked”] = 2

Perhaps namelength has some value? Longer names are generally more aristocratic, pertaining to wealth! The .apply method creates a new feature (column). lambda x: means every time you find something, apply len() to it in line.

titanic[“NameLength”] = titanic[“Name”].apply(lambda x: len(x))

Maybe title? Perhaps more education, or being of the cloth, or a professional or something affected chances ? This is harder because title is within the name column, meaning we have to find it, then create a list of all avalible titles… import re

A function to get the title from a name.

def get_title(name):

Use a regular expression to search for a title. Titles always consist of capital and lowercase letters, and end with a period.

title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
    return title_search.group(1)
return ""

Get all the titles and print how often each one occurs.

titles = titanic[“Name”].apply(get_title) print(pandas.value_counts(titles))

Map each title to an integer. Some titles are very rare, and are compressed into the same codes as other titles.

title_mapping = {“Mr”: 1, “Miss”: 2, “Mrs”: 3, “Master”: 4, “Dr”: 5, “Rev”: 6, “Major”: 7, “Col”: 7, “Mlle”: 8, “Mme”: 8, “Don”: 9, “Lady”: 10, “Countess”: 10, “Jonkheer”: 10, “Sir”: 9, “Capt”: 7, “Ms”: 2} for k,v in title_mapping.items(): titles[titles == k] = v titanic[“Title”] = titles

The last one is family size – maybe having a larger family helped?

import operator

A dictionary mapping family name to id

family_id_mapping = {}

A function to get the id given a row

def get_family_id(row): # Find the last name by splitting on a comma last_name = row[“Name”].split(“,”)[0] # Create the family id family_id = “{0}{1}”.format(last_name, row[“FamilySize”]) # Look up the id in the mapping if family_id not in family_id_mapping: if len(family_id_mapping) == 0: current_id = 1 else: # Get the maximum id from the mapping and add one to it if we don’t have an id current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1) family_id_mapping[family_id] = current_id return family_id_mapping[family_id]

Get the family ids with the apply method

family_ids = titanic.apply(get_family_id, axis=1)

There are a lot of family ids, so we’ll compress all of the families under 3 members into one code.

family_ids[titanic[“FamilySize”] < 3] = -1

Print the count of each unique id.

print(pandas.value_counts(family_ids)) titanic[“FamilyId”] = family_ids

Step 5) Feature Selection Theres a huge number of features we could use; a good way to choose the ones we want to use is univariate feature selection. This essentially goes column by column, and figures out which columns correlate most closely with what we’re trying to predict

import numpy as np from sklearn.feature_selection import SelectKBest, f_classif

predictors = [“Pclass”, “Sex”, “Age”, “SibSp”, “Parch”, “Fare”, “Embarked”, “FamilySize”, “Title”, “FamilyId”]

Perform feature selection

selector = SelectKBest(f_classif, k=5) selector.fit(titanic[predictors], titanic[“Survived”])

Get the raw p-values for each feature, and transform from p-values into scores

scores = -np.log10(selector.pvalues_)

Plot the scores. See how “Pclass”, “Sex”, “Title”, and “Fare” are the best?

plt.bar(range(len(predictors)), scores) plt.xticks(range(len(predictors)), predictors, rotation=’vertical’) plt.show()

Step 6) Cross Validation Essentially, cross validation is splitting the dataset into parts, training the algorithm on some of them, and testing it on others. What this does is prevent “overfitting” of the data, whereby the model fits itself to the quirks of the dataset rather then to the full population. Luckily, cross validation is a simple way to avoid overfitting. To cross validate, you split your data into some number of parts (or “folds”). Lets use 3 as an example. You then do this:

Combine the first two parts, train a model, make predictions on the third.

Combine the first and third parts, train a model, make predictions on the second.

Combine the second and third parts, train a model, make predictions on the first.

Scikit Learn contains a CrossValidation Function as well! It Generates cross validation folds for the titanic dataset, and returns the row indices corresponding to train and test.We set random_state to ensure we get the same splits every time we run this. from sklearn.cross_validation import KFold kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

Now that we’ve got the data cleaned up, our final choices of features selected, and cross validation in place, we’ll apply some machine learning algorithms on them, and see how accurate of a prediction we can get to using various models!

RAID Setups Explained

|

RAID : Redundant Array of Independent Disks

RAID is a data storage virtualization technology that combines multiple physical disk drive components into a single logical unit for the purposes of data redundancy, performance improvement, or both.

Data is distributed across the drives in one of several ways, referred to as RAID levels, depending on the required level of redundancy and performance. The different layouts are named by the word RAID followed by a number, for example RAID 0 or RAID 1. Each RAID level provides a different balance among the key goals: reliability, availability, performance, and capacity. RAID levels greater than RAID 0 provide protection against unrecoverable sector read errors, as well as against failures of whole physical drives.

Some important terms to know (with relation to RAID setups) are :

Striping: the technique of segmenting logically sequential data, such as a file, so that consecutive segments are stored on different physical storage devices.

Disk Mirroring: The replication of logical disk volumes onto separate physical hard disks in real time to ensure continuous availability.EG if disk B is a mirror of disk A , all data on disk A is copied in real time to disk B.

Parity: Im not yet too sure how this works in relation to RAID, but parity is just simple XOR data.

My helpful screenshot

Raid 0: Requires a minimum of 2 disks; it has excellent performance because its striped, but theres no redundancy; don’t use this for any critical system

My helpful screenshot

RAID 1: Requires a minimum of 2 disks; it has good performance because there are no parity or stripes;it has got excellent redundancy since blocks are mirrored

My helpful screenshot

RAID 5: Requires 3 disks; it has good performance (striped), good redundancy (distributed parity); Best cost effective option providing both performance and redundancy. Use this for DB that is heavily read oriented. Write operations will be slow.

My helpful screenshot

RAID 10/1+0: Requires a minimum of 4 disks; Called a “stripe of mirrors”; excellent redundancy (mirrored), excellent performance (striped); If affordable, the BEST option for any mission critical applications (especially databases).

Intro to Hadoop, Yarn, and HDFS

|

Hadoop is an open source framework that you can install on a cluster so they can communicate and work together to store and process large amounts of data.

YARN is a Resource Manager. It tracks the availability of live nodes and resources and coordinates which tasks from which clients get what resources when.

Hadoop uses HDFS, the Hadoop Distributed File System. It uses a master/slave architecture where a single NameNode manages the file system Metadata, and one or more slave DataNodes actually store the data. A file is split into blocks and these blocks are then stored in a set of DataNodes. The NameNode maps the data, the DataNodes deal with the read write, block creation, deletion, and replication.

How does Hadoop work?

Step 1: Submit a job to Hadoop by specifying location of input and output files, the map and reduce functions, and the job configureation parameters.

Step 2: Hadoop submits the job to the Jobtracker, which distrubtes them to the slave nodes and deals with monitoring and scheduling.

Step 3: The tasktrackers execute the MapReduce and store the output of the reduce in the right places.

What Is MapReduce

|

People make MapReduce seem a lot more complicated then it actually is.

MapReduce is a programming model (often implemented with Hadoop in your choice of programming language) that is used to process huge amounts of data.


It has 2 main phases, and a few additional steps.

The first step is some sort of data processing that assigns what data each Mapper is going to have inputted.

The next step is actually running the mapper function; a mapper essentially takes in an input and outputs some key-value pairs.

Third, we have the shuffle step, where all key-value pairs with the same keys are grouped together.

Fourth, each one of these different groups of key value pairs has Reduce() run on it, which will then output any number of key-value pairs.


A common example of map-reduce in the real world is it use it to count word frequencies. You break your input text into groups; then run map on these groups. This outputs each word as a key value pair <word,1>. Each key-value pair with the key word is then group together; Reduce is run on them to output <word,count>.

The major benefit is the ability to run all your maps in parallel, and all your reduces in parallel (since every reduce is run on a single key), which can save time and better utilize compute resources (and allow scalibility.)

Machine Learning Introduction @ Hack Gen Y

|

Machine learning Startup.ml Brings machine learning to startups Arshak

2 open source project vowpal wabbit apache aacumulo

text and image search spam detection speech recognition fraud detection intrusion detection in systems activitity recognition autonomous driving early epidemic detection THIS IS HOW THE CELL PHONES TURN ON TO VOICE

Bad uses of machine learning Banner ads, recommender systems Credit scores Google glass facial recognition Deep learning (feeding system for machine learning) Deep face (for facebook tag recognition)

Machine Learning Inputs ? Program (pramaters, instances) ? prediction

Predictions : Binary Classification, Various categorical divisions,

Regression problem How interested am i in a particular sport? Supervised learning technique (human taught it to machine with example data sets )

UNSUPERVISED: make 5 separate catagories, on your own, you decided you algorithm

Dimensional reduction preserve all these columns of data, but make it into a very few columns so a human can understand it

Inputs: Continuous: Income, age, time spent on page Categorical: state of residence, children, martial status Sparse : pages visted, collection of pixels; DON?T HAVE TOO Much info, only a few pieces of random data

Chris@gervang.com (spark core related work)

Daniel haaser Daniel@makeschool.com