Saurabh Kulkarni - Data Analysis using R

These case studies were performed for the course Data Analysis using R offered by the Math and Statistics Department at UCSD. Each of these projects were done in groups of 2 or 3. Through these projects I was able to implement a typical data analysis workflow right from cleaning the data to visualizations to choosing features to modeling and finally interpreting results. R was used to implement the various statistical modeling methods. Links to each R notebook/script file as well as project report have been provided below

List of Case studies:

Recommendations for Calibration of Snow Gauges
Searching for potential replication sites in Human Cytomegalovirus (HCMV) DNA data
Explore effect of maternal smoking on infant birth weights

Recommendations for Calibration Snow Gauges

Project Links

Calibration of Snow Gauges Code

Project Report

Project Details

Aim was to find a relationship between the density with respect to gain and evaluate the performance of linear regression.

Description: This data is provided by the USDA during one of their calibration sessions in the Sierra Nevada. For our gauge model, different blocks of polyethylene with known densities are placed between the emitting rod and the receiver rod for the gauge. As gamma rays pass through the polyethylene blocks, the gamma ray may be absorbed, bounce away from the receiver rod, or pass through the block to the receiver rod. For each block, 30 measurements are taken, and the middle 10 are reported in the dataset; the measurement from the gauge is called gain There were 9 blocks in total.

Here, I have tried to observe how well simple models like linear and log-linear regression model can be used to model the data. Various visualizations have been performed to illustrate the performance. The aim was not to fit best model but to evaluate the performance of linear regression model (with and without log transformation) and see what all tests we can perform to visualize the performance. For detailed analysis report do checkout the R notebook on my Github page

Searching for potential replication sites in Human Cytomegalovirus (HCMV) DNA data

Project Links

HCMV Code

HCMV Report

Project Details

Aim Identify certain patterns in DNA data (which is 229,354 base pairs long) which could potential DNA replication sites in Human Cytomegalovirus

Description: The Human Cytomegalovirus (HCMV) is a potentially life threatening disease. HCMV is dicult to detect because the virus remains dormant until a critical mass is achieved through reproduction. In order to combat the virus, virologists need to isolate origin of replication so that the reproductive cycle of the virus may be interrupted. Researchers believe that palindrome sites in DNA might be such locations which could be sites of replication. However some palindromes may also occur by chance.

Hence, we are seeking to identify those certain clusters of palindromes in HCMV DNA Data are statistically dierent from the other clusters. In other words, certain clusters of palindromes that do not occur by random chance given the distribution of the clusters may signify the location of the origin of replication. The researchers working on HCMV should begin their search and medical research within these clusters, to save time.

Explore effect of maternal smoking on infant birth weights

Project Details

Aim: This case study uses data to find if maternal smoking affects infant birth weights adversely or not. The different confounding variables that could mislead us during the analysis were also explored.

Description: This was part of a course project and I had to use extensive visualization like histograms, boxplots, QQ plots of higher to illustrate whether or not smoking affects birthweights. The dataset allowed exploration into the possible confounding variables like family’s income, mother’s education and whether they have an influence on the health of the child born. This exercise explored the correlation vs causation argument often encountered in data science case studies.

Project Details

Date: Mar 16, 2016

Categories: project

Tags: R programming, visualizations, Regression, Project Reports

Website: https://github.com/saurabhkulkarni2312/R-Projects/tree/master/Calibrating-Snow-Gauges-Regression

Data Analysis using R

Recommendations for Calibration Snow Gauges

Project Links

Project Details

Searching for potential replication sites in Human Cytomegalovirus (HCMV) DNA data

Project Links

Project Details

Explore effect of maternal smoking on infant birth weights

Project Details

Project Details

Other Works

Amazon Recommendation Systems

Rare-Class Claims Classification System

Data Analysis using R

Handwritten Digit Classification

Transfer Learning using ConvNets

Around the Web

My Bunker

Credits