IBM Data Science Professional Certificate — Capstone Project: The best locations where it pays to open restaurant in New York

ahmed midingoyi
4 min readMay 10, 2021

Introduction/Business Problem

Restaurants are one of the most profitable sectors. However, according to one study, 60 percent close or change owners within the first year of operation, 80 percent fail within five years. Usually, restaurants fail with combination of problems that eventually lead to their closure. A bad location is one of the biggest reasons for restaurant failure. For example, a restaurant can sell the best “burger” in the world. If it is in a poor location (hidden, sparsely inhabited, blind and difficult to access) it will have to put in much more effort to fetch customers than to serve them.

In this context, how to define the best locations where it pays to open a restaurant?

Our objective is to recommend the best locations in New York city (well inhabited, close to subways, distant from existing restaurants) to open restaurant. We don’t distinguish the kind of restaurant.

The purpose of this whole exercise is for submission of the final capstone project for the “IBM Data Science” course on Coursera as well as to showcase my data science skills in the real-world application.

Project Data Source

The data set required for this project provided from four different data sources:

  • Cordinates of the boundaries of Neighborhood Tabulation Areas (NTA) in New York from https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-nynta.page
  • Population Numbers By New York City Neighborhood Tabulation Areas (NTA) from https://data.cityofnewyork.us.
  • Location data of New York city subway station from https://data.cityofnewyork.us. It will allow to determine the minimal distance from a NTA to a subway station and the number of subways located in a given radius.
  • Location data of restaurants provided from Foursquare API. It will help to determine the minimal distance from a restaurant to a NTA and the number of restaurants located in a given radius.

These data required high pre-processing in order to convert it to a working set, capable of handling machine learning algorithms and visualization operations that were implemented on it.

So, we generate a dataframe with a number of rows corresponding to NTA and columns are:

  • longitude and latitude
  • population
  • minimal distance from a neighborhood location to a subway station
  • number of subways located in a given radius
  • minimal distance from a restaurant to a neighborhood location
  • number of restaurants located in a given radius

The best locations are those where there is no or few restaurants, close to subway stations and well inhabited. NTA boundaries and their associated names may not definitively represent neighborhoods. We consider the center of NTA as districts in this exercise.

Methodology

  1. Data pre-processing
  • Download data sets and load into a dataframe NTA are polygons or MultiPolygons. Our strategy is to determine the centeroid of these features which will be named “District”
  • Create function that determines the coordinates of the Point representing the center of NTA
  • Pre-Processing of the Subway stations dataset by extracting coordiantes of the stations
  • Create function that calculate the minimal distance from a NTA to a restaurant, and to a subway
  • Extracting nearest restaurants at a radius of 1 km
  • Feature selection: All the dataframes are combined into a final data frame allowing to apply the classification algorithm

longitude and latitude

population

minimal distance from a neighborhood location to a subway station

number of subways located in a given radius

minimal distance from a restaurant to a neighborhood location

number of restaurants located in a given radius

2. Visualization

  • Using Folium Library to visualize the locations of NTA in New York and for a particular borough

Centeroids of Neighborhoods Tabulation areas

Centeroids of Neighborhoods Tabulations areas in New York

Centeroids of Neighborhoods Tabulation Areas in Manhattan Borough

3- Algorithm of Classification

- We apply KMeans algorithm on the final dataframe. We obtain 4 clusters and 2 give the best places to open new restaurants.

We could repeat the process for each Borough

To learn more, here is a link to my code in Github: https://github.com/AhmcyrCourse/Coursera_Capstone/blob/master/Report.ipynb

Here is a link to my final quick summary report where I discuss my results: https://github.com/AhmcyrCourse/Coursera_Capstone/blob/master/Report.md

--

--