FALL 2021
calendar
Aug
Sept
Oct
Nov
Dec
12 first-day class
midterm Oct 07
CH 1
CH 1
no class
CH 4
CH 4
CH 3
CH 2
final
Dec 02
no class
CH 5
CH 5
CH 6
CH 7
CH 7
CH 7
class 2307588
class 2307589
Time Series Prediction
Well Log Analysis
Data Clustering in GIS
Microscope/fossils
Seismic
Rock
Classification
RULES:
always on time
behave like sit-in class
absent more than 2 times, you will lose bonus score
always on camera (is a must)...I want to observe that you didn't sleep during class
no phone during class
please stay at home or properly studied environment... not coffee shop, beach, restaurant...keep in mind that you have responsibility to answer your friends and instructor's questions
dress properly
the network equations that human can program
a computer to know how to learn
The figures show that the intervals contain sequence stratigraphy of fluvial environment exposed to the surface. Geoscientists can interpret the rocks based on colors, gain sizes, and minerals. So how can we obtain the subsurface information? In order to obtain that information, geoscientists deploy wirelines into a borehole to measure a formation property. Each rock layer contained different kinds of minerals that possess distinguishable characteristics. However, their properties are not discrete, which requires tremendous works to classify. Here, we will follow the SEG competition 2016 to study ML/AutoML workflow to create a robotized framework of petrophysical analysis.
>>> a = 10
>>> a
>>> 10
Value 10 is assigned as "=" assignment operator into variable a, and variable a will be stored value 10 in memory space in the computer. When compiling variable a, the program will return value of 10.
>>> a = a + 2
>>> a
>>>
Let us try to modify the variable that previously stores value 10 in the memory allocation, and this time add a new value (2) into the previous variable a. What is the outcome? All variables and values are input on the right-hand side of this code snipped, and the left-hand side is output stored in the assigned variable named a.
>>> a = a + 4
>>> b = a + 3
>>> a
>>> b
Think about the result and try to explain what variable a and b has been stored.
Did you know? Computer uses the number only 0 and 1 to computerize
string
tuple
set
dictionary
numpy array
data
structure
class and object
list
matrix
vector
numeric precision
float
integer
complex
bit (byte)
A > 80
70 < B <= 80
60 < C <= 70
50 < D <= 60
F <= 50
score = 67
if score > 80:
print('your grade is A')score = [45, 78, 7, 66, 89]
for i in score:
if i > 80:
print('your score is A')
create for loops with statement conditions to compute the average score of this data.
score = [34, 55, 67, 78, 99, 100, 45, 35.9, 88.45, 89]
Try to plot cosine and sine functions
To access element in Python, we need to understand the reference of data position (indexing) and direction to reach the data (row-column-major order). Indexing in Python refers to the position of the memory allocation. By default (Python), it counts the data from "0" and increasing by one. To optimize the speed of accessing element in Python, we should reach the data in row-column major.
row-column-major order
indexing
(2-D)
row (i)
col (j)
indexing
(3-D)
col (j)
3-D (k)
import numpy as np
x = np.array([0, 10, 34, 2, 5, 66, 31, 9])note: () - tuple stores array []
import numpy as np
x = np.array([0, 10, 34, 2, 5, 66, 31, 9])
print(x[0])
print(x[-1])
print(x[3:6])
print(x[-1:-4])Let us try to access the data in 2D array
Try to reproduce the two figures below. Click the GitHub icon to access the guideline on how to reproduce both models.
| Facies | Formation | Well Name | Depth | GR | ILD_log10 | DeltaPHI | PHIND | PE | NM_M | RELPOS |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | A1 SH | SHRIMPLIN | 2793 | 77.45 | 0.664 | 9.9 | 11.915 | 4.6 | 1 | 1 |
| 3 | A1 SH | SHRIMPLIN | 2793.5 | 78.26 | 0.661 | 14.2 | 12.565 | 4.1 | 1 | 0.979 |
| 3 | A1 SH | SHRIMPLIN | 2794 | 79.05 | 0.658 | 14.8 | 13.05 | 3.6 | 1 | 0.957 |
| 3 | A1 SH | SHRIMPLIN | 2794.5 | 86.1 | 0.655 | 13.9 | 13.115 | 3.5 | 1 | 0.936 |
| 3 | A1 SH | SHRIMPLIN | 2795 | 74.58 | 0.647 | 13.5 | 13.3 | 3.4 | 1 | 0.915 |
| 3 | A1 SH | SHRIMPLIN | 2795.5 | 73.97 | 0.636 | 14 | 13.385 | 3.6 | 1 | 0.894 |
| 3 | A1 SH | SHRIMPLIN | 2796 | 73.72 | 0.63 | 15.6 | 13.93 | 3.7 | 1 | 0.872 |
| 3 | A1 SH | SHRIMPLIN | 2796.5 | 75.65 | 0.625 | 16.5 | 13.92 | 3.5 | 1 | 0.83 |
| 3 | A1 SH | SHRIMPLIN | 2797 | 73.79 | 0.624 | 16.2 | 13.98 | 3.4 | 1 | 0.809 |
In case we would like to work with tabular data. Note that we can create tabular data used Python by creating dictionary.
Python provides a basic built-in function that we can use for computing sine and cosine functions. Here, we are going to re-create the 3 functions and try to solve the problem with more than one solution. Let us see Github for guideline.
Let us reproduce all figures
0.02
0.07
0.27
0.09
0.55
0.21
0
1
2
...
840
918
pixel-y
amplitude
Let us crop one oxbow from the image 2D, and then extract only the waterbody.
Hint (1) use the if statement to suppress the unwanted areas. Once you extracted the waterbody successfully, you can count the number of pixels to estimate the size of the waterbody. Hint (2) use for-loop to count values in the 2D matrix that is larger and equal to your waterbody values.
One of the most common technique in ML that has many applications such as speed up the ccomputation, sinking high-dimentional space into vector for compuation.
Show me how you flatten the 2D array
import numpy as np
x = np.array([1, 2, 3, 4, 12, 4.6, -2])
y = np.array([5, 8, 9, -1, -8.3, 2.6, 0])
dummy_of_2D = np.zeros(shape=(len(y), len(x)), dtype=float)
for col in range (0, len(x)):
print('second loop x =', x[col])
for row in range (0, len(y)):
# print('first loop x =', x[col], 'first loop y =', y[row])
dummy_of_2D[row, col] =+ x[col] + y[row]
print(dummy_of_2D.shape)
print(dummy_of_2D)255
208
2
255
127
255
106
255
8
255
vector
x < 4.2
A > 80
A > 80
0
2
4
6
8
2
4
6
8
10
| Facies | Formation | Well Name | Depth | GR | ILD_log10 | DeltaPHI | PHIND | PE | NM_M | RELPOS |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | A1 SH | SHRIMPLIN | 2793 | 77.45 | 0.664 | 9.9 | 11.915 | 4.6 | 1 | 1 |
| 3 | A1 SH | SHRIMPLIN | 2793.5 | 78.26 | 0.661 | 14.2 | 12.565 | 4.1 | 1 | 0.979 |
| 3 | A1 SH | SHRIMPLIN | 2794 | 79.05 | 0.658 | 14.8 | 13.05 | 3.6 | 1 | 0.957 |
| 3 | A1 SH | SHRIMPLIN | 2794.5 | 86.1 | 0.655 | 13.9 | 13.115 | 3.5 | 1 | 0.936 |
| 3 | A1 SH | SHRIMPLIN | 2795 | 74.58 | 0.647 | 13.5 | 13.3 | 3.4 | 1 | 0.915 |
| 3 | A1 SH | SHRIMPLIN | 2795.5 | 73.97 | 0.636 | 14 | 13.385 | 3.6 | 1 | 0.894 |
| 3 | A1 SH | SHRIMPLIN | 2796 | 73.72 | 0.63 | 15.6 | 13.93 | 3.7 | 1 | 0.872 |
| 3 | A1 SH | SHRIMPLIN | 2796.5 | 75.65 | 0.625 | 16.5 | 13.92 | 3.5 | 1 | 0.83 |
| 3 | A1 SH | SHRIMPLIN | 2797 | 73.79 | 0.624 | 16.2 | 13.98 | 3.4 | 1 | 0.809 |
numberic data
integer
floating-point
categorical data
binary
muticlass
text
string
Facies: 3, NM_M: 1
Depth: 2793.0, GR: 78.26
Facies: 2 classes
Facies: 9 classes
data
types
Well Name: SHRIMPLIN
Tabular data provide a lot of information contained multiple rows and columns, and these big data might need to decompose into a scatter plot to gain more understanding about the data. Here, we analyze the relationship between ILD and GR. Note that one log penetrates through multi-layers of rocks showing on the left diagram as multiple colors.
Let us try to reproduce the figure. Note that if you would like to sampling data such as every 10 data points and show only one point, you can use "data = data[::10]".
input data
create line to fit
measuring the error
find the best fit
yes
no
find the shortest paths between each data point and centroid.
yes/no condition, the program will end if the shortest distances between centroids and data points equal the threshold (near 0)
grain size
<0.1
>=0.1
color
white-like
dark-like
density
mineral
>=2.60
hardness
<2.60
Mohs = 1
Mohs = 7
Mohs = 10
quartz
felspar
granite
sandstone
keep in mind that we can not classify granite and sandstone such as these simple rules.
preview a result from this exercise
missing values
normalization
outliers
categorical transformation
| Facies | Formation | Well Name | Depth | GR | ILD_log10 | DeltaPHI | PHIND | PE | NM_M | RELPOS |
|---|---|---|---|---|---|---|---|---|---|---|
| sand | A1 SH | SHRIMPLIN | 2793 | 77.45 | 0.664 | 9.9 | 11.915 | 4.6 | 1 | 1 |
| shale | A1 SH | SHRIMPLIN | 2793.5 | 78.26 | 0.661 | 14.2 | 12.565 | 4.1 | 1 | 0.979 |
| dolomite | A1 SH | SHRIMPLIN | 2794 | 79.05 | 0.658 | 14.8 | 13.05 | 3.6 | 1 | 0.957 |
| sand | A1 SH | SHRIMPLIN | 2794.5 | 86.1 | 0.655 | 13.9 | 100 | 3.5 | 1 | 0.936 |
| shale | A1 SH | SHRIMPLIN | 2795 | 74.58 | 0.647 | 13.3 | 3.4 | 1 | 0.915 | |
| limestone | A1 SH | SHRIMPLIN | 2795.5 | 73.97 | 0.636 | 14 | 13.385 | 3.6 | 1 | 0.894 |
| dolomite | A1 SH | SHRIMPLIN | 2796 | 73.72 | 0.63 | 15.6 | 13.93 | 3.7 | 1 | 0.872 |
| sand | A1 SH | SHRIMPLIN | 2796.5 | 75.65 | 0.625 | 16.5 | 13.92 | 3.5 | 1 | 0.83 |
| sand | A1 SH | SHRIMPLIN | 2797 | 73.79 | 0.624 | 16.2 | 13.98 | 3.4 | 1 | 0.809 |
-999
(13.9+14)/2
| Facies | categorical transformation |
|---|---|
| sand | 1 |
| shale | 2 |
| dolomite | 3 |
| sand | 1 |
| shale | 2 |
| limestone | 4 |
| dolomite | 3 |
| sand | 1 |
| sand | 1 |
min
max
-1
1
73.72
0.624
86.1
0.664
75.65
0.647
???
import data
step 1
preprocessing data
step 2
split data
step 3
decision tree
evaluate model
step 4
step 5
| PE | miss_value |
|---|---|
| 4.6 | 0 |
| 4.1 | 0 |
| 3.6 | 0 |
| -999 | 1 |
| -999 | 1 |
| -999 | 1 |
| -999 | 1 |
| -999 | 1 |
| -999 | 1 |
try to compare between add one more column of missing value and without adding the column
try to compare between add one more column of missing value and without adding the column
This time change ratio of splitting data and max depth
Note that this demo preprocessing only missing values by adding one column. Future work can apply dropping outliers, normalization, etc.
import data
step 1
preprocessing data
step 2
split data
step 3
decision tree
evaluate model
step 4
step 5
step 4
algorithm selection with optimized parameters
The figures show that the intervals contain sequence stratigraphy of fluvial environment exposed to the surface. Geoscientists can interpret the rocks based on colors, gain sizes, and minerals. So how can we obtain the subsurface information? In order to obtain that information, geoscientists deploy wirelines into a borehole to measure a formation property. Each rock layer contained different kinds of minerals that possess distinguishable characteristics. However, their properties are not discrete, which requires tremendous works to classify. Here, we will follow the SEG competition 2016 to study ML/AutoML workflow to create a robotized framework of petrophysical analysis.
continued
lithofacies
well log (Newby)
pairplots with normal distributions
Facies description labels with adjacent facies
|
1 |
Nonmarine sandstone |
|
2 |
|
2 |
Nonmarine course siltstone |
|
1,3 |
|
3 |
Nonmarine fine siltstone |
|
2 |
|
4 |
Marine siltstone and shale |
|
5 |
|
5 |
Mudstone |
|
4,6 |
|
6 |
Wackestone |
|
5,7,8 |
|
7 |
Dolomite |
|
6,8 |
|
8 |
Packstone-grainstone |
|
6,7,9 |
|
9 |
Phylloid-algal bafflestone |
|
7,8 |
non-marine
marine
transition
input data
continued
raw data
preprocessed data
input data
continued
feature engineering
split data by wells
split data by fraction (80%)
NEWBY
LUKE G U
CROSS H CATTLE
SHANKLE
SHRIMPLIN
Recruit F9
NOLAN
ALEXANDER D
CHURCHMAN BIBLE
KIMZEY A
split data by fraction (20%)
training data (80%)
validating data (20%)
models in AutoML
input data
continued
feature engineering
AutoML
input data
feature engineering
AutoML
evaluation metrics
Facies description labels with adjacent facies
|
1 |
Nonmarine sandstone |
|
2 |
|
2 |
Nonmarine course siltstone |
|
1,3 |
|
3 |
Nonmarine fine siltstone |
|
2 |
|
4 |
Marine siltstone and shale |
|
5 |
|
5 |
Mudstone |
|
4,6 |
|
6 |
Wackestone |
|
5,7,8 |
|
7 |
Dolomite |
|
6,8 |
|
8 |
Packstone-grainstone |
|
6,7,9 |
|
9 |
Phylloid-algal bafflestone |
|
7,8 |
we are looking for monkey and non-monkey
hydrocarbon
brine
unsaturation
unsaturation
hydrocarbon
brine
unsaturation
class
precision
recall
F-1 score
hydrocarbon
brine
unsaturation
gradient boosting methods
To improve the efficiency of the gradient boosting methods, we should convert all columns containing categorical data into numeric data. Label-encoding and one-hot-encoder aim to treat this problem. Label-encoding can convert categorical data into numeric data having sequential values. However, this method might implement a bias weight into a more considerable value. To overcome this problem, we use a one-hot-encoder to distribute the converted numeric column from label-encoder to have more columns, and each column contains only binary data.
The synthetical data have one direction, which might be noise rather than data.
To answer some questions about which part of the dataset should be training and testing data, we might shuffle the data and randomly split the dataset into training and testing. Moreover, which parameters are the best for this dataset? These two questions lead data scientists to create cv techniques.
Which spilts are the best for the assigned parameters? Depend right? Practically, we should avoid depend.
the ratio of train and test is fixed
split a dataset into a number of groups (fold), so-called k, and then evaluate the model in each iteration.
similarly with k-Fold, except the size of test data can be varied
all of the tree cv types can combine with stratified technique when test data are sorted to have all labels before evaluation.
input data
continued
normalization
outliers
missing value
encoding
input data
continued
feature engineering
CatBoost
default parameters
early stopping
over-fitting detection
k-fold CV
The final presentation will hold on Jan 03, 2022, starting 9.00 - 12.00. Students must turn on the camera and present for 12 mins talk, 5 mins discussion.
Submit your final presentation by Jan 02 (11.00 p.m)
parameter a
parameter b
parameter a
parameter b