Chris IJ Hwang

I am a Quantitative Analyst/Developer and Data Scientist with backgroud of Finance, Education, and IT industry. This site contains some exercises, projects, and studies that I have worked on. If you have any questions, feel free to contact me at ih138 at columbia dot edu.

View My GitHub Profile



Contents

Integrating R and Tableau Case 1: Principal Component Analysis

Performed Principal Component Analysis of High School students performances of 3 major subjects in Maryland State 2014.

Data

Variables:

School Name
Algebra Adv Pct
Algebra Proficient Pct
Algebra Basic Pct
English Adv Pct
English Proficient Pct
English Basic Pct
Biology Adv Pct
Biology Pforiceint Pct
Biology Basic Pct

###Data Sources [Maryland State][1] [1]: http://www.mdreportcard.org
Files: Algebra_2014.csv
Biology_2014.csv
English2_2014.csv

Steps

Environment Setting

At /etc/Rserv.conf

remote enable
> Rserve(args ="--no-save")

Data Cleansing

  alg = read.csv('files/algebra_2014.csv', header=T)
  names(alg) # Check all names
  keeps = c("School.Name","Advanced.Pct", "Proficient.Pct", "Basic.Pct" ) # The colomuns to keep
  alg = alg[keeps]
  names(alg)[2]= "alg.Advanced.Pct"
  names(alg)[3] = "alg.Proficient.Pct"
  names(alg)[4] = "alg.Basic.Pct"
  alg$School.Name = as.character(alg$School.Name)
  alg$School.Number = as.numeric(as.character(alg$School.Number))
  alg$School.Name = paste(alg$School.Name, alg$School.Number, sep=".")
  alg$School.Name = as.factor(alg$School.Name)
  alg=alg[,-1]
  alg_df = subset(alg, alg.Advanced.Pct!="*" )
  eng_df = subset(eng, eng.Advanced.Pct!="*" )
  bio_df = subset(bio, bio.Advanced.Pct!="*" )
  # English
  for (i in 2:dim(eng_df)[2]) {
    eng_df[,i] = as.character(eng_df[,i])
    eng_df[,i][eng_df[,i]=="<= 5.0"] = "5.0"
    eng_df[,i] = as.factor(eng_df[,i])
  }
  total = merge(alg_df, eng_df, by="School.Name")  
  total = merge(total, bio_df, by="School.Name")  

  write.table(total, file='total.csv', sep=',')  

Analysis and Visualization

pc.cr <- princomp(df, cor=TRUE) # PCA 
pc.cr # print PCA object
summary(pc.cr) # summarize results
loadings(pc.cr) # list loadings
res =pc.cr$sdev^2/sum(pc.cr$sdev^2)
scores = pc.cr$score
as.numeric(scores[,1])

Tableau will call R with SCRIPT_*() at “calculated Fields” modifying the above R code as below:

SCRIPT_REAL(
"n <- max(.arg10)
df <- data.frame(.arg1,.arg2,.arg3,.arg4, .arg5, .arg6, .arg7, .arg8, .arg9)
pc <- princomp(df, cor = TRUE) 
as.numeric(pc$scores[,n])
", 
SUM([alg.Advanced.Pct]),SUM([alg.Proficient.Pct]),SUM([alg.Basic.Pct]),
SUM([bio.Advanced.Pct]), SUM([bio.Proficient.Pct]), SUM([bio.Basic.Pct]),
SUM([eng.Advanced.Pct]), SUM([eng.Proficient.Pct]), SUM([eng.Basic.Pct]),[Component x]
)

Overall look is as below:

alt text

alt text

You can donwload the workbook at my github page or from direct link in the upper-left side.

References:

http://kb.tableausoftware.com/articles/knowledgebase/r-implementation-notes http://www.tableausoftware.com/about/blog/2013/10/tableau-81-and-r-25327 http://www.r-bloggers.com/dream-team-combining-tableau-and-r/ http://www.tableausoftware.com/learn/whitepapers/tableau-and-r-faq