In [ ]:
Copied!
! pip install transformers==4.15.0 sentence-transformers datasets
! pip install transformers==4.15.0 sentence-transformers datasets
Requirement already satisfied: transformers==4.15.0 in /usr/local/lib/python3.7/dist-packages (4.15.0) Requirement already satisfied: sentence-transformers in /usr/local/lib/python3.7/dist-packages (2.2.0) Requirement already satisfied: datasets in /usr/local/lib/python3.7/dist-packages (2.0.0) Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0) (2.23.0) Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0) (0.10.3) Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0) (6.0) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0) (1.21.5) Requirement already satisfied: huggingface-hub<1.0,>=0.1.0 in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0) (0.4.0) Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0) (2019.12.20) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0) (21.3) Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0) (4.11.3) Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0) (3.6.0) Requirement already satisfied: sacremoses in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0) (0.0.49) Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0) (4.63.0) Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0,>=0.1.0->transformers==4.15.0) (3.10.0.2) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->transformers==4.15.0) (3.0.7) Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (1.4.1) Requirement already satisfied: torch>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (1.10.0+cu111) Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (3.2.5) Requirement already satisfied: sentencepiece in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (0.1.96) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (1.0.2) Requirement already satisfied: torchvision in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (0.11.1+cu111) Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.3.5) Requirement already satisfied: fsspec[http]>=2021.05.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2022.3.0) Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.18.0) Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from datasets) (0.70.12.2) Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (6.0.1) Requirement already satisfied: dill in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.4) Requirement already satisfied: aiohttp in /usr/local/lib/python3.7/dist-packages (from datasets) (3.8.1) Requirement already satisfied: xxhash in /usr/local/lib/python3.7/dist-packages (from datasets) (3.0.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.15.0) (2021.10.8) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.15.0) (1.25.11) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.15.0) (2.10) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.15.0) (3.0.4) Requirement already satisfied: asynctest==0.13.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (0.13.0) Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (21.4.0) Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (6.0.2) Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.3.0) Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.2.0) Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (2.0.12) Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.7.2) Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (4.0.2) Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers==4.15.0) (3.7.0) Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from nltk->sentence-transformers) (1.15.0) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2018.9) Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers==4.15.0) (7.1.2) Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers==4.15.0) (1.1.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->sentence-transformers) (3.1.0) Requirement already satisfied: pillow!=8.3.0,>=5.3.0 in /usr/local/lib/python3.7/dist-packages (from torchvision->sentence-transformers) (7.1.2)
In [ ]:
Copied!
from datasets import load_dataset
data = load_dataset('emotion', split='train')
from datasets import load_dataset
data = load_dataset('emotion', split='train')
Using custom data configuration default Reusing dataset emotion (/root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705)
In [ ]:
Copied!
data["label"]
data["label"]
Out[ ]:
[0, 0, 3, 2, 3, 0, 5, 4, 1, 2, 0, 1, 3, 0, 1, 1, 0, 0, 0, 4, 3, 4, 1, 1, 3, 0, 0, 0, 3, 1, 1, 4, 5, 3, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 0, 0, 1, 2, 1, 3, 1, 0, 3, 4, 1, 0, 0, 5, 1, 1, 1, 2, 4, 4, 5, 3, 3, 0, 2, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 3, 0, 3, 3, 3, 1, 1, 1, 1, 0, 4, 2, 3, 0, 3, 2, 0, 1, 1, 0, 3, 2, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 2, 0, 0, 1, 1, 0, 1, 1, 4, 4, 4, 0, 2, 1, 1, 2, 4, 5, 1, 1, 1, 1, 3, 4, 1, 3, 2, 3, 0, 1, 0, 3, 1, 5, 0, 3, 3, 0, 1, 4, 1, 1, 4, 0, 5, 5, 1, 3, 4, 3, 0, 3, 0, 4, 0, 1, 5, 4, 1, 3, 1, 3, 1, 4, 4, 0, 1, 1, 0, 5, 1, 4, 1, 0, 1, 1, 1, 4, 1, 5, 1, 3, 0, 0, 1, 3, 0, 1, 1, 5, 1, 4, 1, 4, 0, 4, 2, 0, 4, 2, 0, 0, 3, 1, 2, 3, 0, 5, 3, 1, 0, 3, 1, 1, 1, 1, 1, 0, 1, 2, 1, 0, 3, 5, 1, 3, 1, 2, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 4, 0, 3, 0, 3, 2, 1, 2, 0, 1, 1, 1, 0, 1, 0, 3, 2, 0, 2, 0, 0, 0, 0, 0, 0, 4, 1, 0, 0, 1, 2, 0, 3, 0, 2, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 2, 4, 0, 4, 1, 1, 4, 1, 3, 3, 2, 0, 5, 1, 3, 0, 0, 3, 2, 5, 0, 2, 1, 3, 1, 0, 0, 1, 1, 4, 0, 3, 1, 2, 1, 1, 1, 1, 1, 2, 1, 4, 4, 1, 1, 3, 1, 1, 1, 0, 5, 1, 1, 0, 5, 1, 4, 1, 3, 2, 3, 4, 5, 1, 3, 0, 0, 0, 4, 1, 4, 1, 3, 0, 2, 3, 0, 2, 2, 1, 0, 1, 1, 3, 1, 2, 1, 1, 0, 5, 1, 1, 0, 5, 3, 1, 1, 2, 1, 2, 2, 1, 0, 1, 0, 4, 3, 0, 3, 1, 4, 5, 2, 0, 0, 1, 5, 2, 0, 0, 1, 4, 1, 1, 1, 2, 4, 0, 0, 4, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 3, 0, 4, 0, 0, 3, 5, 5, 1, 5, 4, 2, 3, 1, 4, 0, 3, 0, 1, 4, 1, 1, 1, 1, 5, 3, 0, 1, 0, 3, 2, 0, 2, 3, 2, 2, 1, 1, 0, 1, 0, 1, 3, 1, 4, 0, 1, 0, 5, 1, 1, 2, 1, 0, 3, 1, 0, 0, 0, 1, 1, 1, 0, 0, 3, 3, 1, 0, 5, 1, 1, 0, 3, 1, 3, 1, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 4, 1, 4, 0, 3, 3, 0, 2, 0, 0, 0, 5, 4, 0, 0, 1, 4, 1, 0, 0, 3, 0, 3, 3, 2, 1, 1, 0, 1, 1, 1, 2, 2, 2, 1, 5, 4, 0, 3, 0, 4, 1, 3, 1, 1, 0, 1, 4, 1, 3, 0, 3, 1, 2, 2, 0, 4, 0, 0, 0, 4, 0, 3, 1, 1, 0, 4, 4, 3, 3, 0, 0, 3, 3, 1, 4, 0, 0, 4, 1, 4, 0, 0, 1, 0, 3, 0, 2, 5, 1, 1, 1, 1, 1, 0, 0, 2, 4, 3, 1, 2, 1, 1, 4, 3, 4, 0, 3, 3, 3, 4, 1, 2, 3, 4, 3, 1, 4, 0, 0, 3, 1, 3, 4, 4, 5, 0, 1, 1, 1, 2, 0, 1, 0, 3, 1, 1, 4, 4, 1, 4, 3, 3, 3, 1, 0, 1, 1, 1, 3, 4, 3, 1, 0, 1, 4, 1, 0, 3, 0, 0, 1, 1, 0, 2, 1, 0, 3, 2, 0, 1, 1, 4, 3, 1, 1, 2, 5, 4, 4, 3, 3, 2, 0, 0, 1, 2, 3, 1, 3, 1, 1, 1, 4, 4, 2, 0, 2, 1, 1, 0, 0, 3, 3, 3, 3, 0, 1, 0, 1, 4, 0, 1, 0, 1, 5, 4, 1, 1, 4, 4, 0, 0, 3, 0, 2, 1, 5, 1, 5, 1, 4, 1, 1, 1, 1, 5, 2, 0, 2, 1, 0, 1, 1, 0, 1, 1, 1, 1, 4, 4, 4, 4, 3, 2, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 4, 1, 3, 1, 1, 4, 3, 1, 3, 3, 3, 3, 3, 1, 4, 0, 2, 0, 4, 2, 1, 5, 1, 1, 1, 1, 0, 5, 1, 3, 3, 3, 1, 1, 0, 2, 0, 1, 3, 3, 3, 4, 1, 1, 4, 0, 5, 1, 2, 0, 1, 1, 0, 0, 0, 3, 1, 0, 0, 0, 0, 1, 5, 1, 3, 2, 1, 3, 0, 1, 3, 0, 4, 4, 0, 0, 4, 0, 2, 4, 1, 1, 4, 1, 3, 1, 1, 3, 1, 1, 2, 3, 1, 1, 1, 3, 1, 2, 4, 3, 2, 0, 4, 0, 0, 2, 1, 4, 2, 1, 1, 0, 1, 0, 4, 1, 0, 2, 0, 2, 0, 0, 3, 2, 3, 1, 3, 0, 0, 1, 1, 3, 0, 5, 1, 2, 0, 3, 4, 2, 1, 1, 3, 1, 0, 1, 1, 1, 2, 1, 0, 2, 4, 0, 1, 0, 3, 1, 1, 0, 1, 2, 1, 1, 1, 3, 4, 0, 1, 0, 0, 2, 3, 1, 2, 0, 1, 4, 0, 1, 1, 0, 0, 2, 1, 0, 0, 0, 0, 3, 0, 1, 0, 1, 0, 2, 3, 1, 4, 0, 0, 1, 1, 3, 4, 2, 1, 0, 3, 0, 1, ...]
In [ ]:
Copied!
data.set_format("pandas")
data.set_format("pandas")
In [ ]:
Copied!
df = data[:]
df.head()
df = data[:]
df.head()
Out[ ]:
text | label | |
---|---|---|
0 | i didnt feel humiliated | 0 |
1 | i can go from feeling so hopeless to so damned... | 0 |
2 | im grabbing a minute to post i feel greedy wrong | 3 |
3 | i am ever feeling nostalgic about the fireplac... | 2 |
4 | i am feeling grouchy | 3 |
Sentence Transformer Embeddings¶
In [ ]:
Copied!
import pandas as pd
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import numpy as np
In [ ]:
Copied!
def embedding_generator(df_path, model):
tqdm.pandas()
model = SentenceTransformer(model)
df['Embeddings'] = df['text'].progress_apply(lambda x: model.encode(x))
return df
def embedding_generator(df_path, model):
tqdm.pandas()
model = SentenceTransformer(model)
df['Embeddings'] = df['text'].progress_apply(lambda x: model.encode(x))
return df
In [ ]:
Copied!
def inputgen(df):
a = []
for i in tqdm(range(len(df))):
a.append(np.array(df["Embeddings"][i]))
a = np.array(a)
print(a.shape)
return a
def inputgen(df):
a = []
for i in tqdm(range(len(df))):
a.append(np.array(df["Embeddings"][i]))
a = np.array(a)
print(a.shape)
return a
In [ ]:
Copied!
df = df[0:2000]
df = df[0:2000]
In [ ]:
Copied!
model1 = "sentence-transformers/all-MiniLM-L6-v2"
inputgen_input = embedding_generator(df, model1)
print("Model 1 embedding completed")
embed_output = inputgen(inputgen_input)
print("Model 1 array completed")
model1 = "sentence-transformers/all-MiniLM-L6-v2"
inputgen_input = embedding_generator(df, model1)
print("Model 1 embedding completed")
embed_output = inputgen(inputgen_input)
print("Model 1 array completed")
100%|██████████| 2000/2000 [00:13<00:00, 149.65it/s]
Model 1 embedding completed
100%|██████████| 2000/2000 [00:00<00:00, 69706.98it/s]
(2000, 384) Model 1 array completed
In [ ]:
Copied!
embed = pd.DataFrame(embed_output)
df = pd.concat([df, embed], axis=1)
df
embed = pd.DataFrame(embed_output)
df = pd.concat([df, embed], axis=1)
df
Out[ ]:
text | label | Embeddings | 0 | 1 | 2 | 3 | 4 | 5 | 6 | ... | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | i didnt feel humiliated | 0 | [-0.055050932, -0.007696963, 0.06353026, -0.03... | -0.055051 | -0.007697 | 0.063530 | -0.039664 | 0.116901 | -0.123296 | 0.058080 | ... | 0.063319 | -0.044138 | -0.034640 | 0.021249 | -0.029084 | 0.084679 | 0.016152 | 0.015425 | -0.135161 | -0.064534 |
1 | i can go from feeling so hopeless to so damned... | 0 | [0.009238882, -0.05296433, 0.01926256, 0.03402... | 0.009239 | -0.052964 | 0.019263 | 0.034021 | 0.125202 | 0.027428 | 0.077058 | ... | -0.016320 | -0.024402 | -0.044897 | 0.132352 | -0.082222 | 0.003469 | 0.095559 | -0.060182 | -0.027176 | -0.026275 |
2 | im grabbing a minute to post i feel greedy wrong | 3 | [-0.074502885, -0.010641884, -0.0034595674, -0... | -0.074503 | -0.010642 | -0.003460 | -0.073246 | -0.018509 | -0.026024 | 0.023559 | ... | 0.050347 | -0.030673 | -0.001018 | 0.019752 | 0.078385 | -0.010269 | 0.041514 | -0.024779 | -0.042020 | 0.024512 |
3 | i am ever feeling nostalgic about the fireplac... | 2 | [0.10859442, 0.09532223, 0.03647684, 0.0151784... | 0.108594 | 0.095322 | 0.036477 | 0.015178 | 0.089073 | -0.012647 | -0.089686 | ... | 0.019334 | -0.076964 | -0.004122 | 0.023587 | 0.056529 | 0.024166 | 0.103731 | -0.044091 | -0.109329 | 0.034851 |
4 | i am feeling grouchy | 3 | [-0.016712204, -0.07877089, 0.032170087, -0.05... | -0.016712 | -0.078771 | 0.032170 | -0.053829 | 0.115593 | -0.051190 | 0.132093 | ... | -0.011990 | 0.003192 | -0.077645 | -0.016146 | 0.007182 | 0.029738 | 0.059137 | -0.062703 | -0.019559 | -0.057704 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1995 | i feel so low and i havent felt this low in a ... | 0 | [0.00746265, -0.072956935, 0.06638428, 0.05247... | 0.007463 | -0.072957 | 0.066384 | 0.052478 | 0.011782 | -0.002765 | 0.015952 | ... | -0.046977 | -0.010428 | -0.048182 | 0.081015 | -0.056379 | -0.043684 | 0.059703 | -0.061770 | -0.076507 | -0.000722 |
1996 | i absolutely love this skinny fiber it is doin... | 1 | [-0.05117633, -0.04654245, 0.042185836, 0.0601... | -0.051176 | -0.046542 | 0.042186 | 0.060135 | 0.005652 | -0.068989 | 0.031247 | ... | -0.010567 | -0.019804 | -0.038921 | 0.050018 | -0.011048 | 0.008017 | 0.041045 | -0.010521 | 0.005083 | -0.002920 |
1997 | i feel as if im in some strange catholic vortex | 5 | [0.0092844125, -0.094246, 0.026456287, -0.0027... | 0.009284 | -0.094246 | 0.026456 | -0.002791 | 0.047796 | -0.070760 | 0.011053 | ... | 0.087301 | -0.005810 | 0.040688 | 0.020890 | -0.022804 | -0.063676 | 0.061230 | -0.020718 | -0.048927 | -0.092958 |
1998 | i have a feeling that many of you will be surp... | 5 | [0.11245156, -0.03292613, 0.11352201, -0.03423... | 0.112452 | -0.032926 | 0.113522 | -0.034240 | 0.062835 | 0.008078 | 0.031408 | ... | 0.096730 | 0.014149 | -0.011893 | 0.077310 | -0.106346 | 0.016515 | 0.078990 | 0.053615 | -0.086291 | -0.006568 |
1999 | i am so connected with families that are not m... | 1 | [-0.03639206, -0.10061437, -0.0018109059, 0.02... | -0.036392 | -0.100614 | -0.001811 | 0.025786 | -0.028809 | -0.010301 | 0.012143 | ... | 0.034679 | -0.006586 | 0.019008 | 0.090166 | 0.052975 | 0.014769 | 0.087924 | 0.023971 | -0.021222 | -0.063778 |
2000 rows × 387 columns
Embeddings Dimensionality Reduction¶
In [ ]:
Copied!
! pip install umap-learn
! pip install umap-learn
Requirement already satisfied: umap-learn in /usr/local/lib/python3.7/dist-packages (0.5.2) Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.4.1) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.21.5) Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from umap-learn) (4.63.0) Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (0.51.2) Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.0.2) Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (0.5.6) Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn) (57.4.0) Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn) (0.34.0) Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from pynndescent>=0.5->umap-learn) (1.1.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.22->umap-learn) (3.1.0)
In [ ]:
Copied!
from umap import UMAP
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import numpy as np
from umap import UMAP
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import numpy as np
In [ ]:
Copied!
def embedding_generator(df_path, model):
tqdm.pandas()
model = SentenceTransformer(model)
df['Embeddings'] = df['text'].progress_apply(lambda x: model.encode(x))
return df
def embedding_generator(df_path, model):
tqdm.pandas()
model = SentenceTransformer(model)
df['Embeddings'] = df['text'].progress_apply(lambda x: model.encode(x))
return df
In [ ]:
Copied!
def inputgen(df):
a = []
for i in tqdm(range(len(df))):
a.append(np.array(df["Embeddings"][i]))
a = np.array(a)
print(a.shape)
return a
def inputgen(df):
a = []
for i in tqdm(range(len(df))):
a.append(np.array(df["Embeddings"][i]))
a = np.array(a)
print(a.shape)
return a
In [ ]:
Copied!
def dimensionality_reduction(embed_arr, label):
X_scaled = MinMaxScaler().fit_transform(embed_arr)
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
df_emb["label"] = label
print(df_emb)
return df_emb
def dimensionality_reduction(embed_arr, label):
X_scaled = MinMaxScaler().fit_transform(embed_arr)
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
df_emb["label"] = label
print(df_emb)
return df_emb
In [ ]:
Copied!
model1 = "sentence-transformers/all-MiniLM-L6-v2"
inputgen_input = embedding_generator(df, model1)
print("Embedding completed")
embed_output = inputgen(inputgen_input)
print("Array completed")
dim_emb_out = dimensionality_reduction(embed_output, df["label"])
print("UMAP completed")
model1 = "sentence-transformers/all-MiniLM-L6-v2"
inputgen_input = embedding_generator(df, model1)
print("Embedding completed")
embed_output = inputgen(inputgen_input)
print("Array completed")
dim_emb_out = dimensionality_reduction(embed_output, df["label"])
print("UMAP completed")
100%|██████████| 2000/2000 [00:14<00:00, 137.54it/s]
Embedding completed
100%|██████████| 2000/2000 [00:00<00:00, 71517.18it/s]
(2000, 384) Array completed X Y label 0 4.410917 10.360144 0 1 7.095046 11.037721 0 2 6.351001 9.905983 3 3 8.679171 9.845695 2 4 5.573699 12.767012 3 ... ... ... ... 1995 6.500954 11.825342 0 1996 9.314243 11.284079 1 1997 6.900813 9.577901 5 1998 8.642849 9.790328 5 1999 7.501454 8.454765 1 [2000 rows x 3 columns] UMAP completed
In [ ]:
Copied!
dim_emb_out
dim_emb_out
Out[ ]:
X | Y | label | |
---|---|---|---|
0 | 4.410917 | 10.360144 | 0 |
1 | 7.095046 | 11.037721 | 0 |
2 | 6.351001 | 9.905983 | 3 |
3 | 8.679171 | 9.845695 | 2 |
4 | 5.573699 | 12.767012 | 3 |
... | ... | ... | ... |
1995 | 6.500954 | 11.825342 | 0 |
1996 | 9.314243 | 11.284079 | 1 |
1997 | 6.900813 | 9.577901 | 5 |
1998 | 8.642849 | 9.790328 | 5 |
1999 | 7.501454 | 8.454765 | 1 |
2000 rows × 3 columns
Visualization of Embeddings¶
In [ ]:
Copied!
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
In [ ]:
Copied!
fig, axes = plt.subplots(2, 3, figsize=(7,5))
axes = axes.flatten()
cmaps = ["Greys", "Blues", "Oranges", "Reds", "Purples", "Greens"]
labels =df["label"].unique()
for i, (label, cmap) in enumerate(zip(labels, cmaps)):
df_emb_sub = dim_emb_out.query(f"label == {i}")
print(df_emb_sub)
axes[i].hexbin(df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap,
gridsize=10, linewidths=(0,))
axes[i].set_title(label)
axes[i].set_xticks([]), axes[i].set_yticks([])
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(2, 3, figsize=(7,5))
axes = axes.flatten()
cmaps = ["Greys", "Blues", "Oranges", "Reds", "Purples", "Greens"]
labels =df["label"].unique()
for i, (label, cmap) in enumerate(zip(labels, cmaps)):
df_emb_sub = dim_emb_out.query(f"label == {i}")
print(df_emb_sub)
axes[i].hexbin(df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap,
gridsize=10, linewidths=(0,))
axes[i].set_title(label)
axes[i].set_xticks([]), axes[i].set_yticks([])
plt.tight_layout()
plt.show()
X Y label 0 4.410917 10.360144 0 1 7.095046 11.037721 0 5 6.884747 12.149796 0 10 5.573589 10.928417 0 13 7.363647 12.668382 0 ... ... ... ... 1983 7.338718 7.762202 0 1987 6.602554 9.052503 0 1989 6.382240 10.995190 0 1992 7.367487 11.797047 0 1995 6.500954 11.825342 0 [544 rows x 3 columns] X Y label 8 7.892040 10.033929 1 11 7.005034 9.483533 1 14 9.480061 8.485753 1 15 6.398749 12.128724 1 22 6.625608 7.409938 1 ... ... ... ... 1988 8.608582 11.570669 1 1991 9.189944 9.329336 1 1993 6.085050 9.369851 1 1996 9.314243 11.284079 1 1999 7.501454 8.454765 1 [702 rows x 3 columns] X Y label 3 8.679171 9.845695 2 9 5.858622 8.135866 2 47 7.230072 8.507003 2 61 5.531404 12.353902 2 68 6.735010 9.334861 2 ... ... ... ... 1972 5.786281 10.251306 2 1973 8.701107 9.829437 2 1978 4.189478 8.249639 2 1984 7.403680 9.066971 2 1985 5.714351 12.618199 2 [173 rows x 3 columns] X Y label 2 6.351001 9.905983 3 4 5.573699 12.767012 3 12 8.151361 10.702692 3 20 4.643733 9.135000 3 24 7.602103 12.617702 3 ... ... ... ... 1942 4.633220 11.428572 3 1946 8.804513 12.250484 3 1953 4.024027 10.380529 3 1963 9.191159 11.068044 3 1986 4.087564 11.096752 3 [268 rows x 3 columns] X Y label 7 5.388288 9.851950 4 19 6.385650 10.144875 4 21 8.048517 11.486029 4 31 6.078864 11.757133 4 53 8.348764 8.977859 4 ... ... ... ... 1955 7.125514 7.996155 4 1965 7.390118 9.450252 4 1975 4.810911 9.063455 4 1990 5.460900 11.757572 4 1994 7.669730 12.339661 4 [236 rows x 3 columns] X Y label 6 6.616123 13.090185 5 32 8.302124 10.460528 5 57 7.081882 8.609643 5 64 7.512016 12.824598 5 129 7.710279 7.798317 5 ... ... ... ... 1928 5.311825 7.922052 5 1945 7.558575 7.223899 5 1954 7.007526 11.439436 5 1997 6.900813 9.577901 5 1998 8.642849 9.790328 5 [77 rows x 3 columns]
In [ ]:
Copied!