Karl Ho
School of Economic, Political and Policy Sciences
University of Texas at Dallas
2025 Institute for Social Science Methodology
The two-part workshop is designed to introduce students to data science and applications. Each course is delivered in three hours, giving overview and survey in subfields of data science with illustrations and hands-on practices. Students should follow pre-class instructions to prepare materials and own device before coming to class.
Please fill out this survey
This introductory course is an overview of Data Science. Students will learn:
Recommended software and IDE’s
Cloud websites/accounts:
Optional software and IDE’s:
Text editor of own choice (e.g. Visual Studio Code, Sublime Text, Bracket)
McKinsey & Co., An Executive’s Guide to AI
Hugo Bowne-Anderson. 2019. "What 300 L&D leaders have learned about building data fluency"
Hugo Bowne-Anderson. 2019. "What 300 L&D leaders have learned about building data fluency"
Hugo Bowne-Anderson. 2019. "What 300 L&D leaders have learned about building data fluency"
Data fluency
Everybody has the data skills and literacy to understand and perform data driven documents and tasks
Danger of immature data fluency
Introduction - Data theory
Data methods
Statistics
Programming
Data Visualization
Information Management
Data Curation
Spatial Models and Methods
Machine Learning
NLP/Text mining
facts and statistics collected together for reference or analysis"
- Oxford dictionary
the representation of facts, concepts, or instructions in a formalized manner suitable for communication, interpretation, or processing by humans or by automatic means"
- McGraw-Hill Dictionary of Scientific and Technical Terms, 2003
a reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing"
- ISO/IEC 2382-1:1993
a set of values of qualitative or quantitative variables"
- Mark A.Beyer, 2014
the basis for:
Beyer, M. A. 2014. "Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data." Gartner Research.
metadata, paradata
Hilbert, M. and López, P., 2011. The world's technological capacity to store, communicate, and compute information. science, p.1200970.
Bits: 8 bits = 1 byte
Bytes: 1024 bytes = 1 KB (1 to 3 digits)
Kilobytes: 1024 KB = 1 MB (4 to 6 digits)
Megabytes: 1024 MB = 1 GB (7 to 9 digits)
Gigabytes: 1024 GB = 1 TB (10 to 12 digits)
Terabytes: 1024 TB = 1 PB (13 to 15 digits)
Petabytes: 1024 PB = 1 EB (16 to 18 digits)
Exabytes: 1024 EB = 1 ZB (19 to 21 digits)
Zettabytes: 1024 ZB = 1 YB (22 to 24 digits)
Yottabytes: more than enough... (25 to 27 digits)
- InsideBigdata.com
CSE - Computer and Information Science and Engineering
ENG - Engineering
SBE - Social Behavioral and Economic Sciences
Mathematics and Physical Science
Grimmer, J., 2015. We are all social scientists now: how big data, machine learning, and causal inference work together. PS: Political Science & Politics, 48(1), pp.80-83.
Science of Data
Understand Data Scientifically
CRMs
One assumes that the data are generated by a given stochastic data model. |
---|
The other uses algorithmic models and treats the data mechanism as unknown. |
---|
Data Model |
---|
Algorithmic Model |
---|
Small data |
---|
Complex, big data |
---|
Data are generated in many fashions. Picture this: independent variable x goes in one side of the box-- we call it nature for now-- and dependent variable y come out from the other side.
The analysis in this culture starts with assuming a stochastic data model for the inside of the black box. For example, a common data model is that data are generated by independent draws from response variables.
Response Variable= f(Predictor variables, random noise, parameters)
Reading the response variable is a function of a series of predictor/independent variables, plus random noise (normally distributed errors) and other parameters.
The values of the parameters are estimated from the data and the model then used for information and/or prediction.
The analysis in this approach considers the inside of the box complex and unknown. Their approach is to find a function f(x)-an algorithm that operates on x to predict the responses y.
The goal is to find algorithm that accurately predicts y.
Unsupervised Learning
Supervised Learning vs.
Source: https://www.mathworks.com
- Hugo Bowne-Anderson
Data programming
}
- Maribel Fernandez 2014
# Create preload function # Check if a package is installed. # If yes, load the library # If no, install package and load the library preload<-function(x) { x <- as.character(x) if (!require(x,character.only=TRUE)) { install.packages(pkgs=x, repos="http://cran.r-project.org") require(x,character.only=TRUE) } }
learning how to program can significantly enhance how social scientists can think about their studies, and especially those premised on the collection and analysis of digital data.
- Brooker 2019:
Chances are the language you learn today will quite likely not be the language you'll be using tomorrow.
- Venables, Smith and the R Core team
Source: Nick Thieme. 2018. R Generation: 25 years of R https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2018.01169.x
The script window:
You can store a document of commands you used in R to reference later or repeat analyses
Environment:
Lists all of the objects
Console:
Output appears here. The > sign means R is ready to accept commands.
Plot/Help:
Plots appear in this window. You can resize the window if plots appear too small or do not fit.
The script window:
You can store a document of commands you used in R to reference later or repeat analyses
Environment:
Lists all of the objects
Console:
Output appears here. The > sign means R is ready to accept commands.
Plot/Help:
Plots appear in this window. You can resize the window if plots appear too small or do not fit.
mydata <- read.csv(“path”,sep=“,”,header=TRUE)
mydata.spss <- read.spss(“path”,sep=“,”,header=TRUE)
mydata.dta <- read.dta(“path”,sep=“,”,header=TRUE)
happy=read.csv("https://raw.githubusercontent.com/kho7/SPDS/master/R/happy.csv")
mydata$column
mydata$Age.rec<-recode(mydata$Age, "18:19='18to19'; 20:29='20to29';30:39='30to39'")
Beware of bugs in the above code; I have only proved it correct, not tried it."
- Donald Knuth, author of The Art of Computer Programming
Source: https://www.frontiersofknowledgeawards-fbbva.es/version/edition_2010/
YAML (Yet Another Markup Language or YAML Ain't Markup Language) is a data-oriented, human readable language mostly use for configuration files)
Undocumented with no or little information on sampling
Link to RStudio Cloud:
https://posit.cloud/content/6625059
- Need a GitHub and RStudio Account
Link to class GitHub:
Link to RStudio Cloud:
https://rstudio.cloud/project/4631380
- Need a GitHub and RStudio Account
Link to class GitHub: