Statathon will be held jointly by UConn Statistics Department and New England Statistical Society in the 35th NESS Symposium (May 22, 2022 – May 25, 2022). Statathon is a statistical data science invention marathon. Anyone who has an interest in data science can attend Statathon to approach a real world data science problem, some of which are local, in new and innovative ways. It emphasizes the statistical aspects (insight, interpretation, significance, etc.) of data science problems that are often overlooked in many hackathons.

The deadline for Theme 2 has been extended until Wednesday noon
(no later than May 4, 2022 12:00 pm EDT)!


23 February, 2022,

Registration Open

Online registration opens, data sets released online with instructions.

16 March, 2022,

Registration Deadline for Individuals Looking for a Team

Individuals looking to join an assigned team should register by this date, and we will provide your team information no later than March 20.

25 March, 2022,

Team Registration Deadline

Teams or individual participants should register by this deadline; online registration will be closed at the end of the day.

1 May, 2022,

Submission Deadline

Deadline for teams to submit their work for the panelist to review. Submission will close 11:59 EDT.

4 May, 2022,


Finalist teams are selected and notified.

22 May, 2022,


Finalist teams present to the review panel in the 35th NESS symposium, virtually. The presentation is scheduled for 5:30 pm – 7:30 pm EDT on May 22, 2022 (Sunday).

00 TBD


Awards to winning teams at the closing ceremony.

Themes and Data

Statathon 2022 focuses on the customer retention theme with the continued support from Travelers. The related data sets can be downloaded from this website below, or from Kaggle. You are encouraged to use related auxiliary data from other sources if necessary.

Theme 1: Customer Retention

For this theme, there are true answers, and a team should focus on proposing the best predictive model. The performance of a team will be mainly based on the predictive performance of the propose method measured by accuracy and the quality of the code. You can use Python's sklearn.metrics.accuracy_score to calculate the accuracy score for your model.

Challenge: Using historical policy data, create a multiclass predictive model to predict the policies that are most likely to be canceled and those most likely to be renewed, as well as understand what variables are most influential in causing a policy cancellation.

Training dataset: 4 years of property insurance policies from 2013 to 2017.

Test dataset: Test data for property insurance policies.

For more details about this theme, please register as a team or register to join a team for the Statathon, and we will send you a link to work on this challenge through Kaggle.

(Data sets are synthetic, provided by Travelers)

Theme 2: Inferring Customer Action Post-Alert

Problem: HSB has deployed monitoring sensors on a specific type of engineering system used by commercial insureds. These monitoring sensors send system health information on regular intervals back to HSB servers where the data is analyzed in real-time. HSB uses a proprietary anomaly detection algorithm on this data to detect poor operating conditions that could lead to damage and insurance claims. In the event an anomaly is detected in the data stream, an alert is sent to the insureds via SMS, email and/or a mobile application. In the ideal case, the insured acknowledges the alert, reviews the health of the engineering system and takes action to mitigate the poor operating condition. Unfortunately, for a variety of reasons, HSB never receives acknowledgment for some alerts. This does not necessarily mean however, that the insured did not take action to mitigate poor conditions; only that they did not feel the need to respond. While HSB is working on methods to incentivize and improve insured response rates, we know we will never have a 100% response rate or a perfectly accurate one (such is human nature). As such, we would like to develop methods to infer whether an insured took action based on the sensor data alone to supplement our customer response data and to better understand the efficacy of our monitoring program.

Task: Your team will be given 25 example alerts and their associated monitoring sensor data time series (preceding and post-alert). Using whatever methodology you see fit, develop a “belief score” and a decision threshold as to whether an insured took action to mitigate the poor operating condition within 12 hours post alert or not. Some important notes are below:

  1. Insured actions would be reflected by an increasing trajectory post-alert and potentially a subsequent stabilization of the variable of interest (referred to as “y” in our data set) that measures system health. Continuing downward trajectories would be evidence that the poor operating condition is still unaddressed and is likely getting worse.
  2. The variable of interest y fluctuates as a function of several measured and unmeasured random variables and is controlled by an automated mechanism (control system) designed to ensure that the entire system is operating healthily and according to specifications. As a result, you may notice that y will display periodic or quasi-periodic patterns. This is important for this analysis because it means that simple rules such as “y increased post-alert” are insufficient for assigning belief scores or action decisions. You are looking for something unexpected given historical patterns and precedent in the data.
    • When this automated control system fails or is functioning inadequately, poor health and anomalous conditions in the system at large may result and generate an alert. The control system may recover, or it may not.
    • Context such as time of day or day of the week may be important in your analysis.
  3. Data from each sensor should be considered to be independent from all other sensors. Each insured’s system is somewhat unique and certainly not affected by the others. However, this does not mean that similar or common patterns will not be observed between sensors.
  4. There is no ground truth (insured response) in this data set.

Data: Your team has been provided with two data sets; one called “alerts.csv” and the other called “ts.csv":

  • alerts.csv contains the following variables:
    • datetime: The time and date the alert occurred
    • time: the integer time index variable where the alert occurred in the related time series data. This value will always be the same (we always provide the same number of sample points pre and post alert in this challenge) and can be used to join the alert to the proper temporal context within the time series easily.
    • sensor: a unique ID for each sensor allowing you to join alert information to the time series data.
  • ts.csv contains the following variables:
    • y: the main variable of interest describing system health. y is measured every 15 minutes.
    • x: a potential exogenous predictor of y measured by the same sensor. The relationship between x and y may vary sensor to sensor and be time dynamic. y cannot cause changes in x, but x may cause changes in y.
    • time: a regular time index provided for your convenience. Note that some values of y or x may be missing for certain time indices. This is a result of data transmission failure and it is your responsibility to use methods that manage this missing data.
    • datetime: The time and date the sensor readings occurred.

(This theme and the data sets are sponsored by Hartford Steam Boiler.)



All teams should register online. If you already have a team or want to participate as an individual, please register using the following link.

Registration form for teams or individual participants.

Each team may have up to four team members, and only one registration form should be submit by each team with all names of the team members.

If you do not have a team but want to be a part of one, please use the following form to register. The organizers will try to match you up with similar participants.

Registration form for individuals looking for a team.

Report submission

All teams should submit their work by the deadline (May/1/2022 11:59 EDT). Teams are encouraged to create a Git Repository (e.g., Bitbucket, GitHub, or GitLab) to host their source code and data information. However, this is not a review factor in the competition.

Customer Retention Theme: Teams working on this theme should submit their work through Kaggle in class using the link provided to you. No presentation slides are required in the first round submission. Finalist teams are expected to create slides based on their work and give presentations in a section of the NESS conference.

Inferring Customer Action Post-Alert Theme: To complete round one, each team must submit a slide deck of no more than 15 pages describing their methods, summarizing results and briefly outlining next steps they will pursue should they pass to round two. Preferred formats are .pdf or .html. Teams must also submit code (preferably well commented) and may use any language they choose to for their analysis. Teams that pass to round two may extend the presentation to 20 slides and must submit updated code. The presentation and code should all be packaged into a .zip file and submitted to TBD.

Team presentations

Ten teams (five from each theme) will be selected in the finalist, and they are invited to give a team presentation to the review panels in the afternoon or evening of May 22, 2022. Each team will have 20 minutes to present their findings and products.


Who can participate?

Students from universities and high schools can participate. We will not distinguish high school students, undergraduate students, and graduate students among participants.

Do I have to pay to participate?

No. Participation is free for Statathon. We will select five finalist teams from each theme to come and present the day before the 35th NESS symposium.

Will the presentation take place in person?

No. The presentation this year will be completely virtual. We are hoping that this will attract students from all over the world.

How big can a team be?

Each team can have up to 4 participants.

How can I form a team?

Participants can form teams among peer students with common interests and/or complementary expertise. If you are not able to find a team yourself, you may either work individually, or request to be assigned with other participants that do not have a team. This is an opportunity for you to meet and work with new people. A participant can be a member of only one team.

When can I start working on the problem?

You can start your work on the problem now.

What programming language can I use?

You can use any programming language or software packages.

Will there be prizes?

Yes! There will be cash prizes for 1st, 2nd and 3rd place teams for both themes ranging from $100 to $300 dollars.

Where can I find the data?

The customer retention theme will be utilizing Kaggle InClass for the Statathon. You can download the data directly from the link provided on the Statathon website Theme.

When do I need to finalize my team?

Teams must be finalized no later than March 25. If you are an individual looking to join an assigned team, you need to register before March 16 and we will provide that information to you no later than March 20.

Can a professor or another professional act as a team mentor?

Yes, a professor or another professional can act as a team mentor. However, this person is not a member of the team and cannot implement any work for the team.

What are the judging criteria for finalists?

Customer Retention Theme: Using the private Kaggle leaderboard, we will evaluate the teams that create the most accurate model score, compared to a gradient boosting machine model benchmark. The code of the top teams on the leaderboard will be reviewed, and based on the model score and code review, we will select 5 finalist teams. We are looking for each team to provide a business recommendation based on the results of your model.

Inferring Customer Action Post-Alert Theme: This challenge is one of inference, not simply predicting a point estimate. As a result you will not be judged based on a measure of predictive accuracy as you may be accustomed to with Kaggle or other similar competitions. Not every challenge in statistics or data science can be solved with supervised learning and point predictions and we hope this problem motivates you to be creative and to try something new. Your team will be judged with a score from 1-10 on the following criteria:

  1. Methodology: Your methodology should be appropriate for the problem, logical and statistically/mathematically rigorous. Extra points will be awarded for novelty and elegance of the solution.
  2. Efficiency: The methods you apply should be computationally efficient and should give consideration to operations and logistics in a real-world setting (easy to implement without overly complex or superfluous workflows) . In practice we may need to evaluate several thousands of these alerts in a given day.
  3. Presentation: This is perhaps the most important criterion. Your team must present your work clearly and concisely with a convincing narrative supported by data and graphics.

The 5 finalists (or more) will be invited to present their work at the symposium and the winners will be selected among them.

Contact Info


Patrick Buckley, Travelers

Nathan Lally, Hartford Steam Boiler (Munich Re Group)

Aolan Li, University of Connecticut

Kelly Li, Travelers

Daeyoung Lim (Chair), University of Connecticut

Tuhin Sheikh, University of Connecticut

Haiying Wang, University of Connecticut

Meiruo Xiang, University of Connecticut

Haiwei Zhou, University of Connecticut


For any further questions, please send them to

Statathon 2019

Go to Statathon 2019


Copyright ©, Department of Statistics, University of Connecticut, All rights reserved