Predictive Modelling and Data Mining

Published by Listed Notes on

Predictive Modelling and Data Mining

Predictive Modelling and Data Mining

Week 1: Introduction to Predictive Modelling

What’s Data: Definitions

  • Data: Value that is measured (continuous, e.g 25, 100) or counted (discrete, e.g male, married, 5). Data by itself does not have a meaning.
  • Information: Interpreted data- adding meaning to data, understanding relations on data. e.g measured data is 25, measuring device is thermometer then the reading is temperature. The attribute temperature adds a meaning to the data.
  • Knowledge: extracting from information rules, patterns, generalization. e.g if humidity is high and temperature is warm and air pressure is above 100 then it will likely rain.
  • Wisdom: understanding the principles embodied in the knowledge to make judgment or decisions. e.g if you expect rain then take an umbrella with you. 

What’s Data

  • Data Modeling: Creating a structure, organization, function or an abstract view of the data.
  • Data Analysis: Transforming or operating on data to extract useful information, knowledge or conclusions.

Data Mining: Carrying this further to discover unforeseen or hidden patterns in the data.

Descriptive versus Inferential Analysis

  • We have data (samples). This data is a sample of a population (more than just the measured or observed sample).
  • Descriptive Analysis is the type of analysis and measures to describe and summarize the data in hand, the available samples. We can not, in general, use it to for interpretation of unobserved data.
  • Inferential Analysis (predictive) is the type of analysis that can describe measures over the population of data. That is observed and unobserved.

–Take for example calculating the mean of a sample. It is correct for just the sample but it is descriptive of the sample. For inferential analysis it can be only an estimate of the mean of a population that has a slight variation of the sample. 


What is Predictive modeling?

  • Predictive modeling is a process that uses data mining, statistics and probability to predict a set of variables (i.e. response variables) based on another set of variables (i.e. predictors).
  • Example: Predictive Modelling is utilised in vehicle insurance to assign risk of incidents to policy holders from information obtained from policyholders. The general form of the data set here is as

 The usual response variable is claim cost and we can use the first seven variables as predictors.

  • Example: Washington University conducted a clinical study to determine if biological measurements made from cerebrospinal fluid (CSF) can be used to diagnose or predict Alzheimer’s disease (CraigSchapiro et al. 2011).

The usual response variable is predictor$Diag and the rest can be used as predictors.

  • Example: Prof.Yeh (1998) describes a collection of data sets from different sources that can be used for modeling the compressive strength of concrete formulations as a functions of their ingredients and age.

Example: (scheduling Data) These data consist of information on 4331 jobs in a high performance computing environment. Seven attributes were recorded for each job along with a discrete class describing the execution time. The predictors are: Protocol (the type of computation), Compounds (the number of data points for each jobs), InputFields (the number of characteristic being estimated), Iterations (maximum number of iterations for the computations), NumPending (the number of other jobs pending at the time of launch), Hour (decimal hour of day for launch time) and Day (of launch time). The classes are: VF (very fast), F (fast), M (moderate) and L (long).

In the course:

  • In this course we will go through various predictive methods and algorithms. We will talk a little about the other steps mentioned in the previous slides, such as preparing data.
  • In this course, we use python. It is faster than R.
  • By the end of the course you should be able to build and deploy predictive models in practice and build on the understanding that you gained in BDA 101 of the underlying concepts in predictive modelling.

First part of assignment 1

  • At the first part of assignment 1, we need to install various packages. Let’s install them and look at their descriptions.
  • There are different methods to import python packages into your environment:

1.Import ‘package-name’

2.Import ‘package-name’ as ‘some name’

3.From ‘package-name’ import *

4.From ‘package-name’ import  ‘specific function’

Second part of assignment 1

  • In the second part of assignment 1, we need import the packages that we installed before and run some basic commands to get used to these packages.
  • For the remainder of this lesson, please work on assignment 1. Your solutions may be uploaded to Avenue to Learn for feedback. Note that today’s assignment does not have any weight in the final grading scheme.
  • However, Assignment 1 forms an important basis of knowledge for the remaining assignments that will be included in your final grade. We have found that students who did not complete assignment 1 were unable to complete future assignments in a timely manner.

Python warmup: Basics

  • Variables: To assign a variable in python use the following format:


                        “name of variable”=“value of variable”

                         Example:   a=2

  • Variable type:

1.Numbers: Python supports three different numerical types

1.Int (integer): 2,123,-8

2.Float (decimal): 0.02,-1.43

3.Complex : 2-3j, 43+98j

2.Strings: Strings in Python are identified as a contiguous set of characters represented in the quotation marks. For example, “Hello” is a string. Subsets of strings can be taken using the slice operator ([ ] and [:] ) with indexes starting at 0 in the beginning of the string and working their way from -1 to the end.

  • Type str = “BDA is Awesome!” into your IDE console and select some subsets from the string

The plus (+) sign is the string concatenation operator and the asterisk (*) is the repetition operator. For example:

 str = ‘Hello World!’

print (str) # Prints complete string

print (str[0]) # Prints first character of the string

print (str[2:5]) # Prints characters starting from 3rd to 5th print

(str[2:]) # Prints string starting from 3rd character

print (str * 2) # Prints string two times

print (str + “TEST”) # Prints concatenated string

3.Lists: Lists are the most versatile of Python’s compound data types. A list contains items separated by commas and enclosed within square brackets ([]). To some extent, lists are similar to arrays in C or R. The values stored in a list can be accessed using the slice operator ([ ] and [:]) with indexes starting at 0 in the beginning of the list and working their way to end -1.

The plus (+) sign is the list concatenation operator, and the asterisk (*) is the repetition operator. For example

list = [ ‘abcd’, 786 , 2.23, ‘john’, 70.2 ]

tinylist = [123, ‘john’]

print (list) # Prints complete list

print (list[0]) # Prints first element of the list

print (list[1:3]) # Prints elements starting from 2nd till 3rd

print (list[2:]) # Prints elements starting from 3rd element

print (tinylist * 2) # Prints list two times

print (list + tinylist) # Prints concatenated lists

Lists have the unique property that they can contain data elements of different types, i.e. strings, numbers (called floats), objects etc.

➢Create some lists in the console to apply some of the commands above

4.Tuples: A tuple is another sequence data type that is similar to the list. A tuple consists of a number of values separated by commas. Unlike lists, however, tuples are enclosed within parenthesis. 

 The main difference between lists and tuples is- Lists are enclosed in brackets ( [ ] ) and their elements and size can be changed, while tuples are enclosed in parentheses ( ( ) ) and cannot be updated. Tuples can be thought of as read-only lists. For example

tuple = ( ‘abcd’, 786 , 2.23, ‘john’, 70.2 )

tinytuple = (123, ‘john’)

print (tuple) # Prints complete tuple

 print (tuple[0]) # Prints first element of the tuple

print (tuple[1:3]) # Prints elements starting from 2nd till 3rd

print (tuple[2:]) # Prints elements starting from 3rd element

print (tinytuple * 2) # Prints tuple two times

print (tuple + tinytuple) # Prints concatenated tuple

tuple = ( ‘abcd’, 786 , 2.23, ‘john’, 70.2 )

list = [ ‘abcd’, 786 , 2.23, ‘john’, 70.2 ]

tuple[2] = 1000 # Invalid syntax with tuple

list[2] = 1000 # Valid syntax with list

  • You can use various functions to different types to each other.

int(x) : Converts x to an integer

float(x) : Converts x to a floating-point number.

str(x) : Converts object x to a string representation.

Categories: Big Data Analysis