In New York City, a letter grade in a restaurant window is more than just a piece of paper, it’s a major driver of foot traffic, revenue, and public perception. But what actually determines whether a kitchen earns an "A" or a "C"?
For our Big Data Analytics (CIS 5450) Final Project, we dove into the NYC Department of Health (DOHMH) restaurant inspection dataset to build a machine learning pipeline that predicts inspection grades and identifies the high-risk factors that lead to failure.
Special thanks to my teammates, Mudit and Samen for their collaboration on this project.
Data Cleaning & Feature Engineering
NYC conducts hundreds of thousands of restaurant inspections each year, resulting in a large and complex dataset that is not immediately ready for modeling. We began with this raw data and applied several key data-cleaning and preprocessing steps to make it suitable for analysis. First, we standardized text fields using the unidecode library to resolve encoding inconsistencies, such as converting accented characters (e.g., “CAFÉ” to “CAFE”) for uniformity. Next, we addressed the high cardinality of cuisine types: the original dataset included over 100 distinct cuisines, so we designed a mapping strategy to consolidate them into 11 broader categories, including “Asian,” “Latin,” “Mediterranean,” and “American,” reducing noise and improving model generalization. Finally, we filtered the dataset to include only valid A, B, and C inspection grades from 2010 onward to ensure the analysis reflected modern health standards. The dataset used in this project can be found here.
The Class Imbalance Problem
We identified a major hurdle: Class Imbalance. "A" grades make up the vast majority of the data. A naive model could achieve 80% accuracy simply by predicting "A" for every restaurant. This realization shifted our focus from Accuracy to Macro-F1 Score, which weights the performance of minority classes (B and C) equally.
Modeling Strategy
We approached the problem using a hierarchy of models, moving from simple linear benchmarks to complex neural networks.
Logistic Regression & PCA
We started with Multinomial Logistic Regression. To handle high-dimensionality and collinearity, we applied Principal Component Analysis (PCA), retaining 95% of the original variance. To handle the class imbalance, we tested both class-weighted and oversampled versions of the model. Both methods gave us a similar baseline Macro-F1 score of 0.61.
Ensembles Models (Random Forest & XGBoost)
To capture non-linear relationships and higher-order feature interactions, we employed Random Forest and XGBoost models. Unlike logistic regression, these tree-based approaches can learn complex decision boundaries without explicit feature engineering and are well suited for datasets with mixed numerical and encoded categorical features.
These models are also better at handling imbalanced datasets. Random Forest handles class imbalance primarily through class weighting and aggregation across many trees, which can reduce sensitivity to skewed class distributions. XGBoost addresses class imbalance more directly by iteratively focusing on misclassified examples which allows for explicit control via class-weighting parameters, helping improve minority-class performance.
Hyperparameter Tunning
We tuned the hyperparameters of both models using Randomized Search and Bayesian Optimization to evaluate which approach produced the best performance.
The results of the tuned models are similar to each other but the best result was the XGBoost tuned with Bayesian Optimization. We saw a 6% macro-F1 score improvement from the untuned model and a 24% macro-F1 score improvement from the baseline Regression model.
Neural Network
We implemented a feedforward neural network using TensorFlow and Keras to test whether deep learning’s ability to model complex, non-linear interactions could outperform our tree-based models. To handle the severe data imbalance, we utilized a Weighted Neural Network approach, which adjusted the loss function to penalize misclassifications of 'B' and 'C' grades more heavily. This allowed the model to look beyond simple linear relationships and better identify the subtle combinations of borough, cuisine, and violation types that signal a high-risk establishment.
The Neural Network achieved a f1 score of 0.5516. The Neural Network performed worse than our Random Forest and XGBoost. This could be because the Ensemble Models are more effective at capturing the discrete, rule-based decision boundaries inherent in the NYC health code.
Time Series
We incorporated Time Series analysis to move beyond static snapshots and understand the evolving nature of food safety in New York. Time Series Data captures gradual hygiene changes, as restaurants may improve or decline over time due to management, policy shifts, or seasonal effects.
From the graph we can see that there is a huge decrease in As and an increase in Cs during 2012 and 2013. This was likely driven by a combination of Superstorm Sandy’s impact on restaurant infrastructure and a period of peak enforcement stringency where minor "administrative" violations often pushed establishments into lower grade brackets. This trend began to disappear after 2014 following major city reforms that reduced fines and introduced "cure periods" for non-food-safety infractions, allowing restaurants to maintain better grades while focusing on critical hygiene. Ultimately, the recovery seen in our time series highlights how NYC's inspection outcomes are a reflection of both environmental shocks and shifting city policies.
Feature Importance
Next we wanted to identified which features were the top drivers of inspection outcomes. This would be helpful to provide actionable insights for restaurant owners. We performed Feature Importance on the Ensemble Models because they were the ones that had the highest F1 Scores.
The graph to the left is for random forest and the one to the right is for xgboost. We can see that for both, the three most important features were:
Zipcode - where the restaurant is located,
Inspection month – when the inspection took place,
Cuisine type – what the restaurant specializes in.
Hypothesis Testing
We decided to use a significance level of 1% (alpha = 0.01). We chose a significance level of 1% to be very conservative in identifying features that truly influence inspection grades. This stricter threshold reduces the likelihood of false positives, meaning we are less likely to incorrectly conclude that a feature is important when it is not. Since our dataset is large, even small effects can appear statistically significant at higher levels, so using 1% helps us focus only on the most robust and meaningful relationships between features and grades.
For categorical features: Null hypothesis (H_0): The categorical feature is independent of GRADE. Alternative hypothesis (H_1): The categorical feature is associated with GRADE.
For numerical features: Null hypothesis (H_0): The means of the numerical feature are the same across the different groups of GRADE. Alternative hypothesis (H_1): At least one group mean is different.
As shown in the figure, the negative logarithm of each feature’s p-value exceeds the −log(0.01) threshold, indicating that all features are statistically significant predictors of grade. Accordingly, we reject all null hypotheses.
Implications
Our project offers a blueprint for improving restaurant safety in NYC by delivering actionable insights to multiple stakeholders. For inspectors, our models can help prioritize high-risk restaurants for targeted or surprise inspections. For restaurant owners, the analysis highlights specific violation prefixes that pose the greatest risk to their final letter grade, enabling more focused compliance efforts. For consumers, our findings increase transparency by providing a clearer view into the consistency and reliability of their favorite dining establishments.