Glossary

0-9

1-shot learning 5G + AI 6DoF pose estimation 7D representation 8-bit quantization 2-stage detector 4D data 0-shot learning 9-layer network 3D convolution

A

AGI / Artificial General Intelligence Autoencoder Attention Algorithm Artificial Intelligence (AI)

B

Backpropagation BERT Boosting Batch Normalization Bias

C

Chatbot Clustering CNN / Convolutional Neural Network Cross-Validation Classifier / Classification

D

Deep Learning Deepfake Discriminative Model Deterministic Model Data Augmentation

E

Embedding Encoder Epoch Ensemble Learning Explainable AI (XAI)

F

Fine-tuning Fusion / Multimodal Fusion Forward Propagation Foundation Model Feature Extraction

G

GAN / Generative Adversarial Network Gradient Descent Grounding Graph Neural Network (GNN)Generative AI

H

Hyperparameter Heuristic Hidden Layer Hierarchical Model Hallucination

I

Imbalanced Data Interpretability Instruction tuning Instance / Sample Intelligence Amplification / Augmentation

J

JAX Jittering Joint Embedding JSONL / JSON-lines Juxtaposition

K

KL Divergence (Kullback–Leibler Divergence)K-means Clustering K-Shot Learning Kernel Trick Knowledge Distillation

L

Latent Variable Loss Function LSTM / Long Short-Term Memory Large Language Model (LLM)Learning Rate

M

Multimodal / Multimodality Machine Learning (ML)Meta-learning Model Multi-head Attention

N

Normalization Neural Network NLP / Natural Language Processing NLU / Natural Language Understanding Novelty Detection / Anomaly Detection

O

Objective Function Online Learning One-hot Encoding Overfitting Optimizer

P

Policy / Reinforcement Learning Policy Pooling Pretraining Prompt Parameter

Q

Queue / Buffer Quantization Q-learning Query Quality Estimation

R

Retrieval Augmented Generation (RAG)Representation Learning Reinforcement Learning (RL)Regularization RNN / Recurrent Neural Network

S

Supervised Learning Self-Supervised Learning Sequence Modeling Sampling Softmax

T

Training Data Tokenizer Transfer Learning Transformer Tuning / Hyperparameter Tuning

U

Universal Approximation Theorem Unsupervised Learning U-Net Underfitting Uncertainty Estimation

V

Variational Autoencoder (VAE)Vector Embedding Vanishing / Exploding Gradient Validation Set Vision Transformer (ViT)

W

Weak Supervision Weight Decay Whitening / Whitening Transformation Word Embedding Workflow

X

XOR problem X-axis / feature axis XAI / Explainable AI XLM XLNet

Y

Y-axis / feature axis Y-transform / YUV YAGNI (You Aren't Gonna Need It)Yield (model yield / throughput)Yoga of AI

Z

Z-score Normalization Zero-gradient phenomenon Zero-shot Learning / Zero-shot inference Zero-centric / Zero-bias initialization Zygosity in augmentation

Tokenizer là gì?

Tokenizer - AI and technology concept illustration

© 2025 / unsplash.com

Tokenizer là một thành phần quan trọng trong xử lý ngôn ngữ tự nhiên (NLP) và phân tích ngôn ngữ lập trình. Nó có trách nhiệm phân tách văn bản đầu vào thành các đơn vị nhỏ hơn, thường là từ, từ con hoặc ký hiệu, để tiến hành xử lý tiếp theo.

Tokenization là bước đầu tiên trong xử lý văn bản, tạo nền tảng cho nhiều thuật toán và mô hình, đặc biệt là trong bối cảnh học máy và học sâu. Các ngôn ngữ và ứng dụng khác nhau yêu cầu các loại tokenizer khác nhau; ví dụ, tokenizer dựa trên khoảng trắng hoạt động tốt cho tiếng Anh, trong khi tokenizer dựa trên ký tự hiệu quả hơn cho tiếng Trung.

Tầm quan trọng của tokenization nằm ở khả năng cung cấp thông tin có cấu trúc cho việc phân tích và xử lý dữ liệu văn bản. Bằng cách phân tách văn bản thành các token, các thuật toán có thể dễ dàng nhận diện mẫu, trích xuất đặc điểm và tạo ra dự đoán. Do đó, việc chọn lựa một tokenizer phù hợp là rất quan trọng để đảm bảo hiệu suất của mô hình.

Khi trí tuệ nhân tạo và học máy tiếp tục phát triển, các phương pháp tokenization cũng đang tiến hóa. Nhiều mô hình hiện đại sử dụng các kỹ thuật tokenization dựa trên từ con, chẳng hạn như Byte Pair Encoding (BPE) hoặc WordPiece, có thể xử lý hiệu quả các từ hiếm và thuật ngữ mới, cải thiện khả năng tổng quát của mô hình.