Data Modelling Portfolio

This is a list of my projects:

  • House Prices: Advanced Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting.

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

The link to my notebook:

  • Red Wine Quality

Simple and clean practice dataset for regression or classification modelling.

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

This dataset is also available from the UCI machine learning repository , https://archive.ics.uci.edu/ml/datasets/wine+quality .

The link to my notebook:

  • Used Cars Dataset

Vehicles listings from Craigslist.org

Craigslist is the world’s largest collection of used vehicles for sale, yet it’s very difficult to collect all of them in the same place. I built a scraper for a school project and expanded upon it later to create this dataset which includes every used vehicle entry within the United States on Craigslist. This data is scraped every few months, it contains most all relevant information that Craigslist provides on car sales including columns like price, condition, manufacturer, latitude/longitude, and 18 other categories. For ML projects, consider feature engineering on location columns such as long/lat. For previous listings, check older versions of the dataset.

The link to my notebook:

  • Lending Club Loan Data

Analyze Lending Club’s issued loans

Lending Club is a peer to peer lending company based in the United States, in which investors provide funds for potential borrowers and investors earn a profit depending on the risk they take (the borrowers credit score). Lending Club provides the “bridge” between investors and borrowers. For more basic information about the company please check out the wikipedia article about the company.

This Project contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the “present” contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file.

The link to my notebook: