One of my favorite data sources is the NYC OpenData site. A while back I noticed a really interesting data set on that site, and I’ve been mulling over what exactly to do with it for a while. The data set in question is the 2015 Street Tree Census. I knew there was some interesting question that this data could answer, I just had to think of it (or ask my girlfriend to think of it for me.

In this blog post I hope to introduce you to the powerful and simple Metropolis-Hastings algorithm. This is a common algorithm for generating samples from a complicated distribution using Markov chain Monte Carlo, or MCMC.
By way of motivation, remember that Bayes’ theorem says that given a prior \(\pi(\theta)\) and a likelihood that depends on the data, \(f(\theta | x)\), we can calculate \[ \pi(\theta | x) = \frac{f(\theta | x) \pi(\theta)}{\int f(\theta | x) \pi(\theta) \; \mathrm{d}\theta}.

(This is a companion to my post on Paul Gronke’s earlyvoting.net)
One of the first assignments we had in my Election Sciences course was to take a look at registration data from the Oregon Motor Voter program and try to find interesting patterns. For those who don’t know, Oregon Motor Voter is an automatic voter registration program in Oregon. Whenever someone interacts with the Oregon DMV, their voter eligibility is automatically checked, and if they are eligible to vote but not registered, they are automatically added to the rolls.

In 1861 the small town of Hagelloch, Germany experienced a measles outbreak. A doctor very carefully recorded the time of infection and symptoms for each patient.1 In the 1990’s, another German doctor went through all this data and was able to deduce the source of infection of each child.2 This data set gives us a wealth of information about the spread of disease, but it also allows us the rare opportunity to view disease as spreading over a network.