EXAMPLE
OF A CULMINATING PROJECT
In this section
you will find an example of a Culminating Project put together by
two future teachers as they worked through the MDM4U course content
for the first time. The Culminating Project is a major component of
the MDM4U course. Students can find personal interest in the subject
they take up in their project and they can do a very good job if they
start the project early in the course and return to it at various
times. It is my recommendation that students identify an area of interest
early in the term and that they be required to have found appropriate
data within the first two weeks of classes. Additional data can be
identified as the course develops. When a major section of the course
is concluded students should be asked to reflect and apply what they
have learned to advance their project. They will find that some sections
are directly applicable to their project and area of interest, while
other sections will only provide an opportunity to explore concepts
that marginally apply to their area of interest. A variety of activities
have been assembled in the area we have called Projects by Sections.
These are very much work in progress and some of the work did not
make the Culminating project. These are examples of the best that
the students could do at the time, however on further investigation
new ideas were generated or new data was found that fitted better
into the Culminating Project. For your benefit links are provided
in the Culminating Project to parts of the Projects by Sections that
did not make the Culminating Project. This final product is certainly
not perfect and one could argue with some of the statements and conclusions
made within the Culminating Project. Nevertheless we hope that you
will find this resource useful and that you will join the discussion
group of mathematics teachers of MDM4U. Best wishes with the course.
Eric Muller
CULMINATING PROJECT
Mathematics for Data Management
- MDM4U
By
Sherrie Dyck and Bruce Petrie
Why am I paying so much for my car insurance?
Section
1 - searching for an area of interest and locating data
I started this project by exploring various areas
of possible interest. These were universities, employment and vehicle
costs especially that of car insurance. Although my initial areas
of interest were quite varied. (For a diagram of my brainstorming
click
here)
I found that I had to limit my search for information to something
that was more manageable. My search for data soon led me to concentrate
on the issue of car insurance.
The first question that I raised was whether car insurance rates
were affected by alcohol use. Although one company discounted the
cost for abstaining from alcohol I did not notice any substantial
difference so I dropped this avenue but for interest I sought out
data on alcohol related accidents and found the internet source
http://www.nh-dwi.com/caip-206.htm
from the Community Alcohol Information Program (CAIP), a private,
non-profit agency founded in 1977 to provide alcohol education, assessment,
and evaluation services to persons convicted of alcohol related offenses
in New Hampshire. These statistics are complied by the U. S. Dept.
of Transportation and the N. H. Department of Safety. Since I found
no data on Canada, I did not pursue it further. I did not follow up
any further on the issue of alcohol and driving.
I decided to search for data on car insurance costs and found that
quotes could be obtained from
www.kanetix.com
which allowed me to compare costs for different insurance companies.
Once the questionnaire was filled out, the only variable that I changed
was age so that I could get a fair comparison without changing other
variables. As I suspect the insurance costs varied with age and gender.
So I decided to follow this up by looking for information on Canadian
age and gender data from ESTAT available from Statistics Canada.
The data that was collected from ESTAT was accessed in the following
way:
1. Go to
http://estat.statcan.ca/ and choose ENGLISH.
2. Accept the terms of the preceding Licence Agreement.
3. From the Table of Contents choose Eduction, Data,
and Students.
4. Search for the following table numbers to locate the information
regarding the possible research topics.
110-0002
110-0029
I also searched for data which would give me information on accident
rates and age of the driver. I did find some data on the Fathom CD
itself which I obtained as follows:
1. Choose Open from the File menu.
2. Open the Sample Documents folder, then the Learning Guide
Starters,
and then Accidents.
For information on what the document should look like, click here.
Although this was a start I did not have any information on Ontario.
It is only much later into the project that I located data that I
would have liked to have had from the beginning! In the Ontario Ministry
of Transport web site
http://www.mto.gov.on.ca/english/
I did a search on "driver licenses by age", from which I found a
number of Ontario Road Safety Reports and I selected one of the more
recent ones for the year 2000 under the heading "Ontario Road Safety
Annual Report 2000 - PDF", with web address
http://www.mto.gov.on.ca:80/english/safety/orsar/orsar00/ors_00.pdf
With these various sets of data I was ready to start narrowing down
my questions and exploring what the data I had found could point to.
Section
2 - Applying various mathematical techniques in the analysis and exploration
of the data
Section
2a - Analyzing data involving one variable
Problem
There were serious limitations concerning the Fathom data itself.
I preferred the Ministry of Transportation data because it was Canadian
and compares accidents to age and the age distribution of drivers.
Although I did quite a bit of analysis with the Fathom data, it was
only after I located the MTO data website
http://www.mto.gov.on.ca/english,
and its Ontario Road Safety Annual Report 2000 that I made substantial
progress.
With this data, exploration will take place concerning drivers involved
in collisions and drivers killed with both put into perspective concerning
the amount of drivers licensed per age group. This should allow me
to draw useful conclusions concerning the distribution of insurance
rates.
Plan
To gather data from the MTO website, access the internet address
http://www.mto.gov.on.ca/english/ and press "search".
Search the MTO for the "Ontario Road Safety Annual Report 2000".
Access or download the full report, not chapters, in .pdf format.
The .pdf file can be found at the internet address
http://www.mto.gov.on.ca/english/safety/orsar/orsar00/ors_00.pdf
.The data to be used is found on table 2.2 "Category of Person
Killed by Age Groups 2000" and table 2.20 "Driver Age Groups
- Number Licensed, Collision Involvement and Per Cent Involved in
Collisions 2000". I think the limitations of the data are insignificant,
e.g. unlicensed drivers are taken into consideration for "Drivers
Involved in Collisions" but not added to the "Driver's Licensed"
used to calculate the "% of Drivers of Each Age Involved in Collisions".
Data
The data obtained from the MTO is as follows as entered into Fathom
Collection Charts:
I have revised the charts to include
the age group 16-24 and the data has been restricted to include only
the data I want to explore.
In Fathom the following dot plots were made
to graphically represent the data in table 2.2 and table 2.20:
Graph 1: Population of Licensed Drivers by
Age Group
Graph 2: Population of Drivers
Involved in Collisions by Age Group
Graph 3: The percentage of Drivers
Involved in Collisions by Age Group
Graph 4: The amount of Drivers Killed
in an Accident by Age Group
Graph 5: The Percentage of Drivers Killed
(considering Drivers Licensed) in an Accident by Age Group
Analysis
Graph 1 shows a bell-shaped curve of licensed drivers by age group
with the most licensed drivers being in the 35-44 year old range.
Graph 2 shows that the age group involved in most collisions is the
35-44 year olds. Graph 3 shows a regression in values and that the
highest percentage of drivers involved in collisions when the amount
of licensed drivers is taken into consideration is the 16-24 year
olds. This reveals that even though there are more 35-44 year olds
involved in accidents than 16-24, it can be explained by the greater
amount of 35-44 year old licensed drivers. Graph 4 shows a regression
in deaths as age increases but when graph 5 is taken into consideration
there is not really a difference in the amount of deaths between age
groups.
Conclusion
The data represented in graph 3 reveals that
the percentage of drivers involved in a collision decreases as age
increases. However, our data does not take into account the amount
of driving done by age group or how often they are on the road. This
data could support a decrease in insurance rates as age increases
so in the next section I will explore insurance costs and further
explore the data in graph 3.
Section
2b - Analyzing data involving two variables
Problem
Now that I have seen that the percentage of drivers involved in a
collision decreases as age increases, tests should be done to see
if there is a linear relationship. That is, what is the equation of
the line of best fit? Then a correlation coefficient needs to be determined
to see how well the line fits. I will also compare what I find to
the data I retrieved from Kanetix concerning insurance costs.
Plan
To determine the line of best fit, y= a + bx, x and y must first be
defined. Then numerous calculations need to be done concerning the
sum of squares. Using the values calculated to determine the equation
of the line of best fit, a correlation coefficient will be determined.
Data was obtained from Kanetix by entering standard information that
remained constant. A different age was entered for each set of data.
Ages I entered were between 18 and 32 with 2-year intervals (i.e.
18, 20, 22, 24, ..., 32). This was done for both males and females.
The data obtained from Kanetix was entered into a collection chart
in Fathom. Fathom was also used to create a scatter plot to see the
relationship between the age of a driver and cost of insurance.
Data
Let x be the midpoint for each of the age groups. Let y be the percentage
of
drivers involved in a collision.
|
|
Analysis
The following calculations were made with Fathom:
Therefore the equation of the line of best fit is:
% of drivers involved in collisions = 8.628 - 0.0826
x.
Analysis
To see how well the line fits, a correlation coefficient, r, needs
to be calculated. The correlation coefficient was calculated with
Fathom.
The closer |r| is to 1, the stronger the correlation
and since -1<r<1 and the correlation coefficient for the line
of best fit is -0.984 there exists a strong negative correlation.
Therefore, y = 8.628 - 0.0826 x fits the data very well. The following
is a graph showing the original scatter plot including the Line of
Best Fit.
Although not required for the course, I wanted to try a similar procedure
with the Kanetix data as I did with the MTO data. I played with an
equation to find a curve with the best fit. However, due to the nature
of the curve, at some exponents the curve would not be calculated
to the left of the vertex. For observational purposes the centre was
set to be the last age group and the data was entered so to create
a reflection along the vertical line marked by the last actual age
group. The formula for the curve is shown below each diagram.
The Kanetix data plots show that a non-linear relationship
exists between insurance costs and age regardless of gender. From
this limited data, insurance rates drop as age increases regardless
of gender.
Conclusion
There exists a linear relationship between the percentage of drivers
involved in an accident and age. The equation of the line of best
fit is percentage = 8.628 - 0.0826 age. The correlation coefficient
is -0.984. In the Kanetix data, I can see a non-linear regression
in the insurance rates as opposed to the linear regression in the
collisions.
Section
2c - Probability distributions
Problem
I want to explore, with Fathom, the probabilities and their respective
distributions from the data attained from the MTO. I want to look
at the probability distributions of male and female drivers, the collisions
of male and female drivers, and drivers killed in collisions, all
with respect to their ages. I would like to see whether I can find
any differences in these.
Plan
Using Fathom, I will create a collection chart entering the data from
Table 2.2 and 2.20 from the MTO data. I will use Fathom to calculate
relative frequencies for male and female drivers as well as the relative
frequency of drivers killed. The relative frequencies for male and
female drivers will take the total population into consideration.
I will then create graphs to look at the probabilities (we will use
relative frequency as the best estimate to the probability) calculated
in the collection chart. That is, the chart is a collection of data
and the resulting probabilities.
Data
Graph 1: Relative Frequency of Drivers
Killed by Age Group
Graph 2: Relative Frequency of Female
Drivers by Age Group
Graph 3: Relative Frequency of Male
Drivers by Age Group
Graph 4: Percentage of Male Collisions
by Age Group
Graph 5: Percentage of Female Collisions
by Age Group
Graph 6: Percentage of Male and Female
Collisions by Age Group
Analysis
Graph 1 shows a bell shaped curve showing that 35-44 year olds are
killed more often in collisions. Graph 2 and Graph 3 show a bell shaped
curve demonstrating a similar pattern in driving age population for
both males and females, with most drivers on the road being between
35 and 44. Graph 4, 5, and 6 have a similar pattern concerning collisions
by age group percentage. The highest probability of being in an accident
whether you are male or female occurs when you are 18 years old.
Conclusion
Graph 2 and 3 shows us the probability of finding a 35-44 year old
driver is highest amongst all driver age groups for both male and
females. That is, male and female 35-44 year olds make up the most
drivers and thus we have a greater probability of finding them on
the road. Since they have the highest population on the road, they
also have the greatest probability of being killed while driving.
This is illustrated in graph 1. Looking at Graphs 4 we see that 18
year old males have the highest probability of males of being in a
collision. That is, the probability of being in a collision if you
are an 18 year old male is 11.4%. From Graph 5, we see that 18 year
old females have the highest probability of females of being in a
collision. That is, the probability of being in a collision if you
are an 18 year old female is 6.9 percent. Looking at Graph 6 we see
that 18 year olds have the highest percentage concerning their population
involved in a collision. That is, the probability of being in a collision
if you are 18 years old is 9.3%. Almost 1 out of 10 18 year olds is
involved in a collision. This is important to note concerning the
insurance prices for 18 year olds.
Section
2d - Project on simulation
Exploration
In this case I will be using simulation to explore a situation rather
than trying to answer a particular question. I will be simulating
the gender of the driver in an OPP random check of 100 vehicles (not
including large truck etc. which require special driver's license)
on the 401 for a given day of the year. I aim to explore how the composition
of these samples can vary when the procedure is repeated. I will look
at the proportion of male drivers in each sample and the mean and
standard deviation of the data accumulated when the experiment is
repeated.
Plan
As an estimate of the probability of the driver of the car being male
or female I used the data provided by the Ontario Ministry of Transport
in its 2000 Ontario Road Safety Annual Report, Table 2.16 which provides
Sex of Driver Population by Age Groups 2000. As part of the simulation
I will assume that only Ontario drivers will be stopped.
Analysis
From Table 2.16 of the report 2000 Ontario Road Safety Annual Report
I find that there are 4,313,694 male drivers out of a total of 8,121,374
licensed Ontario drivers. Thus the probability that a random licensed
Ontario driver is male is 0.531 (to three decimal places). Since this
is a two outcome experiment and if I assume that drivers are statistically
independent the experiment suggests a Binomial Probability model.
So, with a sample of 100 vehicles, the mean number of males driving
the cars is given by µ = np or µ
= 53.1 males, with a standard deviation of
However this does not provide me with an indication
of what could be the result in each sample. To get this view I tried
a simulation and to do this I followed the Instructions given in a
Fathom Workshop Guide (reference)
The results of the simulation follow:
I introduced a slider for the probability of
stopping a male driver and set it to as close to 0.531 as I could
Through the simulation I generated a table of the
sample data
from which I generated a Bar Chart of
the gender distribution for 100 drivers. A typical distribution was
I then repeated the simulation 200 times,
in each case, noting the mean proportion of male drivers in each sample.
These 200 data were then plotted in a bar chart
and the mean and standard deviation
of this distribution was calculated
Observations
Through the simulation I saw that the sample composition of male and
female drivers could change quite a bit from sample to sample, not
only in terms of totals but also in the order in which they appeared.
When this process was repeated a large number of times, the mean proportion
of all the samples was close to the one on which I based my simulation,
and although this number changed slightly as the number of repetitions
was increased, it stayed consistently close to 0.531, which I noticed
it is µ/n (where µ is the mean of the Binomial distribution).
The behaviour of the standard deviation was a bit more erratic than
that of the mean but it did move around the value of 0.05. I explored
to see whether this was related to any of the values that I used in
the simulation. I found that it is close to the sqrt(.431x.469) and
is therefore also close to ó/sqrt(n) (where ó is the
standard deviation of the Binomial distribution).
Section
2e - Project on the Normal Distribution
Exploration
The Fathom Workshop Guide that I used in my exploration of simulation
suggests that the distribution of the mean proportion of males in
my samples follows a Normal distribution.
Plan and Data
What I plan to do is to use the data generated from the simulation
and to analyse the distribution through the tools supplied by Fathom.
Analysis
If the simulation data follows
a Normal distribution I would expect the cumulative distribution
to look like an S shaped curve very similar to the one I obtained.
I can get better visual confirmation that the data follows a Normal
distribution by looking at the Normal Quartile Plot provided by
Fathom.
From the Fathom Help file - A Normal
Quartile Plot shows the distribution continuous (numeric) data.
It plots the z-scores associated with the percentile of each case
if the data were normally distributed. Therefore, if the data are
Normal, the plot should show a straight line. My simulation data
is very close to the straight line shown on the plot.
Finally, I can do some checking whether the
distribution of my simulation data
has the following properties of the Normal Probability distribution:
50% of the data falls on each side of the
mean
About 68% of the data falls within one standard
deviation of the mean
About 95% of the data falls within two standard
deviations of the mean
About 99% of the data falls within three
standard deviations of the mean
For my simulation data the mean proportion is 0.531, the standard
deviation is 0.0497 and the Dot Plot is
where each dot represents two data.
I counted the data in each interval and divided by 200 hundred to
get the percentage within each interval. I found the percentage
of data that falls on either side of the mean is 49.5% and 50.5%,
which is about 50% on each side of the mean. The percentage of data
that falls within one standard deviation (0.48 - 0.58) of the mean
is 73.5, which is larger than the predicted 68%. The percentage
of the data that falls within two standard deviations (0.43 - 0.63)
of the mean is 97%, which is larger than 95%. The percentage of
the data that falls within three standard deviations (0.38 - 0.68)
of the mean is 100%.
I did not have time to explore whether the approximations
that I did (rounding off the mean and standard deviation, and counting
from the Dot Plot rather than using the raw data) biased these results.
Conclusion
The work that I did suggests that the distribution
of the data I obtained from the simulation could be modelled by
the Normal probability distribution.
Project
Conclusions
So why am I paying so much for car insurance and
what have I learned? While searching for an area of interest and
locating data I discovered that limitations and validity of data
can seriously weaken or strengthen our project and our focus and
direction can change at any time. I also discovered that format
may also change and restructuring to strengthen the project may
be needed. A lot of research and work went into this project that
didn't make the final culminating project. This would seem to reflect
a lot of statements concerning the importance of behind-the-scenes
work. Since our exploration accident rates and age showed that there
is a linear reduction in our chances of being in a collision as
we get older, understanding why we, being young drivers, pay more
for insurance is fairly simple: we have a greater chance of being
in an accident. Although insurance rates are extremely high for
young adults they do decrease in a non-linear regression and is
much more reasonable by our mid-twenties.
With our exploration into probability we can say
that there are a greater amount of licensed 35-44 year olds and
thus we have a greater probability of finding them on the road.
Since they are the largest population of licensed drivers they are
killed more often in collisions and are in the most collisions.
However, 18 year olds have the highest percentage concerning their
age population of being in an accident.
While investigating simulation we saw that sample
composition of male and female drivers could change quite a bit
from sample to sample, not only in terms of totals but also in the
order of which they appeared. We compared our simulation data to
a normal distribution by comparing their properties. Namely, 50%
of the data lies on each side of the mean and comparing the amount
of data within 1,2, and 3 standard deviations from the mean.
I have learned a lot while doing this project,
not only about car insurance and accident rates, but also in the
amount of work and organization required to do research. It is disappointing
however to report that insurance rates are so high for young adults
because of something they can't change, their age. So consider high
insurance rates as a cost associated with a youth so many would
like to have back.
|