Text Vectorization

PhD Course, CuCEng, 2025

Course Objective

By the end of this course, students will understand the theoretical foundations behind transforming textual data into numerical representations that can be processed by machine learning models.
Starting from classical Vector Space Models and TF-IDF, the course explores a wide spectrum of embedding techniques up to modern Transformer-based contextual models. Students will gain both theoretical insight and practical implementation skills.

Assessment and Evaluation

Application Presentation 1: 20% (Week 5)
Application Presentation 2: 25% (Week 8)
Application Presentation 3: 25% (Week 11)
Final Project / Research Paper Presentation: 30% (Week 15)

Weekly Schedule

Week 1: Introduction and Fundamentals

Topic: Evolution of text representation. Why vectorization? Basic NLP operations (tokenization, normalization, stemming/lemmatization).
Preparation: Jurafsky & Martin, Speech and Language Processing (Chapters on Regular Expressions, Text Normalization).
Method: Lecture, Q&A, Discussion.

Week 2: Frequency-Based Models

Topic: Bag-of-Words, One-hot vectors, Vector Space Model (VSM), cosine similarity.
Preparation: Manning, Raghavan & Schütze, Introduction to Information Retrieval (Ch. 6).
Method: Lecture, Mathematical formulation.

Week 3: Weighting and Dimensionality

Topic: TF-IDF, Sparsity problem, term specificity, and introduction to PCA/SVD for dimensionality reduction.
Preparation: Sparck Jones (1972), A Statistical Interpretation of Term Specificity…
Method: Lecture, hands-on examples.
Assignment: Build a simple search engine using TF-IDF and cosine similarity.

Week 4: Distributional Semantics

Topic: Distributional Hypothesis (“You shall know a word by the company it keeps”). Word2Vec architectures: CBOW & Skip-Gram.
Preparation: Mikolov et al. (2013), Efficient Estimation of Word Representations in Vector Space.
Method: Lecture, architecture illustration.

Week 5: Application Presentation 1 — Classical Models

Scope: Classical and statistical representations (BoW, TF-IDF, and VSM).
Task: Each student presents an implementation project (e.g., document retrieval, similarity analysis, or keyword extraction).
Evaluation: Methodology (40%), Results (40%), Presentation clarity (20%).
Method: Student presentations and peer feedback.

Week 6: Supervised Embedding Models — The SemSpace Approach

Topic: Supervised data in embedding learning; Generalized SemSpace methodology for context- and class-aware representation learning.
Focus: Using labeled data to construct semantic vector spaces beyond unsupervised Word2Vec/GloVe.
Preparation: Orhan, U. (2023), Generalized SemSpace: Supervised Contextual Embedding Model.
Method: Lecture, mathematical formulation, and small demonstration.

Week 7: Sentence Embeddings — ELMo and Contextualized SemSpace

Topic:
- BiLSTM-based contextual embeddings (ELMo)
- Contextualized SemSpace (sentence-level supervised contextual model)
- Transition from word-level to sentence-level representations
Preparation: Peters et al. (2018), Deep Contextualized Word Representations; Orhan, U. (2024), Contextualized SemSpace: A Hybrid Contextual Embedding Approach.
Method: Lecture, comparative analysis, architecture visualization.

Week 8: Application Presentation 2 — ELMo-based Contextual Embeddings

Scope: Hands-on exploration of ELMo embeddings and contextualization performance.
Task: Train or use pre-trained ELMo embeddings on a small dataset (semantic similarity, NER, or sentiment classification).
Deliverables: Short demo notebook, metrics comparison, and findings on contextual understanding.
Method: Student presentations and Q&A.
Note: No midterm exam (replaced by presentations).

Week 9: Transformer Revolution

Topic: From sequence models (RNN/LSTM) to Attention and Self-Attention mechanisms; Transformer encoder-decoder structure.
Preparation: Vaswani et al. (2017), Attention Is All You Need.
Method: Lecture, architecture visualization, discussion.

Week 10: Contextual Models — BERT Foundations

Topic: BERT architecture, pre-training objectives (MLM and NSP), embedding structure, and fine-tuning concept.
Preparation: Devlin et al. (2019), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Method: Lecture and code examples.
Preparation for Week 11: Teams prepare small fine-tuning experiments for their BERT-based task.

Week 11: Application Presentation 3 — BERT-based Embeddings

Scope: Each team presents a BERT-based application (e.g., text classification, STS, or NLI) demonstrating contextual embeddings in practice.
Deliverables: Code notebook, evaluation metrics, error analysis, and discussion.
Method: Student presentations only (no new lecture content).

Week 12: Positional Encoding in Transformers

Topic:
- Learned Absolute Positional Encoding (APE): BERT, GPT-2
- Rotary Positional Encoding (RoPE): LLaMA, Qwen, Mistral, DeepSeek
- Relative Positional Encoding: T5 and successors
- How positional encodings interact with token embeddings in Transformer layers.
Preparation:
- Shaw et al. (2018), Self-Attention with Relative Position Representations
- Su et al. (2021), RoFormer: Enhanced Transformer with Rotary Position Embedding
Method: Lecture, equation-level explanation, visualization demo.

Week 13: Transformer Architectures and Vector Representations

Topic:
- Encoder-only (e.g., BERT, RoBERTa)
- Decoder-only (e.g., GPT family)
- Encoder-Decoder (e.g., T5, BART)
- Comparative discussion of how each architecture builds and uses embedding vectors.
Preparation:
- Vaswani et al. (2017), Attention Is All You Need
- Raffel et al. (2020), Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)
Method: Lecture, comparative diagrams, concept mapping.

Week 14: Evaluating Embedding Quality

Topic:
- Intrinsic vs Extrinsic evaluation
- Benchmarks: GLUE, SuperGLUE, and MTEB
- Probing techniques and bias measurement in embeddings
Preparation: Wang et al. (2018), GLUE Benchmark; Muennighoff et al. (2023), MTEB Benchmark
Method: Lecture, discussion, optional mini-lab.

Week 15: Final Project Presentations

Scope: Student final presentations only (no new lecture).
Task: Each student either (a) presents their own embedding-based project, or (b) reviews and critiques a recent research paper on modern embedding models.
Evaluation: Originality (40%), Technical depth (30%), Presentation clarity (30%).
Method: Student presentations and feedback.