Boston Housing Dataset, its origins, and its relevance to Real Estate Analysis

Introduction: Introduce the Boston housing dataset, its origins, and its relevance to real estate analysis and machine learning.

Data Acquisition:

Describe how the dataset was sourced from the statistical repositories and loaded into a Python environment using Pandas and requests.
Provide an overview of the dataset's structure and initial exploration.

Data Cleaning and Preparation:

Detail the steps taken to clean the dataset, including handling missing values, converting data types, and ensuring consistency across columns.

Exploratory Data Analysis (EDA):

Discuss the exploratory analysis conducted on key features:
- Crime rate per capita (CRIM)
- Proportion of residential land zoned for lots over 25,000 sq.ft. (ZN)
- Proportion of non-retail business acres per town (INDUS)
- Presence of Charles River dummy variable (CHAS)
- Nitric oxides concentration (NOX)
- Average number of rooms per dwelling (RM)
- Proportion of owner-occupied units built prior to 1940 (AGE)
- Weighted distances to employment centres (DIS)
- Index of accessibility to radial highways (RAD)
- Full-value property-tax rate per $10,000 (TAX)
- Pupil-teacher ratio by town (PTRATIO)
- Proportion of blacks by town (B)
- % lower status of the population (LSTAT)
- Median value of owner-occupied homes in $1000's (MEDV)

Data Visualization:

Showcase histograms and plots that illustrate the distributions and correlations of these features.

Machine Learning:

Implement a linear regression model using Scikit-learn to predict housing prices based on selected features.
Discuss model evaluation metrics such as mean squared error (MSE) and coefficient of determination (R²).
Optional: Explore polynomial regression or feature engineering to improve model performance.

Findings:

Based on the 1st graph below, the y-axis (frequency) indicates the majority of people desired to pick the location that is close to their work, known as employment centre.

Untitled

Based on the 2nd graph below, the y-axis is referring to the concentration of students per teacher.