# 07. AI on-device strategy

> **Đối tượng đọc**: Tech lead OmiScan, mobile + backend engineer, người chuẩn bị PoC tuần 2.
> **Mục đích**: Khóa lựa chọn kỹ thuật về VLM/OCR on-device, multi-shot capture, quality gate và prompt engineering trước khi vào sprint phát triển.

---

## 7.1. Kết luận tổng quan (TL;DR)

OmiScan áp dụng **kiến trúc hybrid 3 tầng**:

1. **Quality Gate** real-time trên camera preview (≤ 5s đánh giá ảnh, đỏ/vàng/xanh) — Laplacian variance + ML Kit
2. **Hybrid OCR + VLM on-device**: **Gemma 4 E2B** qua plugin `flutter_gemma` làm VLM chính + **ML Kit Text Recognition v2** làm OCR ground-truth để cross-check chống hallucinate
3. **Plan B server fallback** chỉ kích hoạt khi: Tier B thermal throttle, ≥ 3 hallucinated field, hoặc ảnh quá nhiều > 6/sản phẩm

**Lý do chọn E2B (không E4B)**:
- Tier B chỉ 4–6GB RAM; E4B chiếm ~3.4GB peak GPU memory trên flagship → KHÔNG an toàn cho mid-range
- E2B ~1.45–1.73GB peak, TTFT GPU ~0.3s → headroom cho budget 8s end-to-end
- E4B chỉ tốt hơn E2B ~38% trên OmniDocBench, nhưng đánh đổi RAM gấp đôi không đáng

**Multi-shot strategy**: chia thành 4 ảnh chuyên biệt (front, back, nutrition, extras) — chạy per-image inference + merge JSON ở Dart-side, không gửi 4 ảnh trong 1 prompt vì sẽ vượt budget prefill trên Tier B.

**Trước go-live bắt buộc**: benchmark thật trên 5 device target + pilot 200 nhãn Việt Nam đo per-field accuracy và hallucination rate.

---

## 7.2. Trạng thái thực tế Gemma 4 (release 02/04/2026)

### 7.2.1. Họ model và đặc điểm

Gemma 4 phát hành ngày **02/04/2026**, có 4 size: **E2B (2.3B effective / 5.1B với embeddings), E4B (4.5B / 8B), 26B A4B MoE, và 31B Dense**. Hai biến thể edge (E2B/E4B) được thiết kế cho mobile/edge với **128K context window**, native multimodal (text + image + audio + video).

Ba điểm quan trọng cho OmiScan:

- **Multi-image native**: "Freely mix text and images in any order within a single prompt" — repeat token `<|image|>` được hỗ trợ.
- **Configurable visual token budget**: 70 / 140 / 280 / 560 / **1120 tokens/ảnh**. Cao = giữ chi tiết text (tốt cho OCR/document), thấp = nhanh (tốt cho classification). Google khuyến nghị **560–1120 cho OCR/document**.
- **Native JSON / function calling**: Có structured output, không cần special instruction.

### 7.2.2. Benchmark vision của Gemma 4

| Benchmark | E2B | E4B | Ý nghĩa cho OmiScan |
|---|---|---|---|
| OmniDocBench 1.5 (edit distance, lower=better) | 0.290 | **0.181** | E4B đọc document tốt hơn ~38% |
| MATH-Vision | 52.4% | 59.5% | Hiểu cấu trúc bảng dinh dưỡng |
| MMU Pro (vision) | 44.2% | 52.6% | Visual reasoning chung |

E4B vượt trội 30–40% ở document tasks, nhưng đánh đổi RAM gấp đôi.

### 7.2.3. Latency benchmark thực tế (LiteRT-LM, prefill 1024 + decode 256)

> ⚠️ **Quan trọng**: Bảng dưới chỉ là số liệu chính thức từ litert-community Hugging Face, **chỉ trên flagship 2026 (iPhone 17 Pro, Galaxy S26 Ultra) — KHÔNG phải Tier B target** (iPhone XS/11/12 Pro, Pixel 8, Galaxy A53/S24). **Số Tier B ước tính ở mục 7.2.4 là extrapolation từ flagship + community report, chưa qua benchmark trực tiếp**. Cam kết "Tier B ≤ 8s" trong Section 02.6 là **giả định cần verify**, không phải cam kết chắc chắn — sẽ chạy PoC tuần 2 trên 5 device target và GO/NO-GO theo kết quả thực.

| Model | Device | Backend | Prefill | Decode | TTFT | RAM peak | Model size |
|---|---|---|---|---|---|---|---|
| **E2B** | iPhone 17 Pro *(flagship)* | GPU | 2.878 t/s | 56.5 t/s | **0.3s** | 1.450 MB | 2.583 MB |
| **E2B** | iPhone 17 Pro *(flagship)* | CPU | 532 t/s | 25.0 t/s | 1.9s | 607 MB | 2.583 MB |
| **E2B** | Galaxy S26 Ultra *(flagship)* | GPU | 3.808 t/s | 52.1 t/s | **0.3s** | 676 MB | 2.583 MB |
| **E2B** | Galaxy S26 Ultra *(flagship)* | CPU | 557 t/s | 46.9 t/s | 1.8s | 1.733 MB | 2.583 MB |
| **E4B** | iPhone 17 Pro *(flagship)* | GPU | 1.189 t/s | 25.1 t/s | 0.9s | **3.380 MB** | 3.654 MB |
| **E4B** | iPhone 17 Pro *(flagship)* | CPU | 159 t/s | 9.7 t/s | 6.5s | 961 MB | 3.654 MB |
| **E4B** | Galaxy S26 Ultra *(flagship)* | GPU | 1.293 t/s | 22.1 t/s | 0.8s | 710 MB | 3.654 MB |
| **E4B** | Galaxy S26 Ultra *(flagship)* | CPU | 195 t/s | 17.7 t/s | 5.3s | 3.283 MB | 3.654 MB |
| **E2B** | iPhone XS / 11 / 12 Pro *(Tier B)* | — | **CHƯA BENCHMARK** | — | — | — | — |
| **E2B** | Pixel 8 / Galaxy A53 *(Tier B)* | — | **CHƯA BENCHMARK** | — | — | — | — |

**Lưu ý quan trọng cho Tier B (target thực của OmiScan)**:
- Google chính thức khuyến nghị **E4B cho flagship**, **E2B cho mid-range** — nhưng "mid-range" Google 2026 ≈ flagship 2024 (Pixel 8 RAM 8GB). Tier B của OmiScan có cả device cũ hơn (iPhone XS RAM 4GB).
- Đã có issue confirmed: "Gemma 4 E2B on Pixel 8: 0.10.0 fails on GPU decode, 0.10.1 runtime patch fixes it" (LiteRT-LM #1850) — **Tier B Pixel 8 cần version-pin cẩn thận** hoặc fall-back CPU.
- **PoC tuần 2 BẮT BUỘC** đo TTFT, decode rate, RAM peak, thermal throttle trên: iPhone 12 Pro, iPhone 15 Pro, Pixel 8, Galaxy S24, Galaxy A53. Nếu Tier B fail → fallback Plan B (server LLM) — đã chuẩn bị tại mục 7.8.
- Khả năng cao iPhone XS / RAM 4GB device sẽ KHÔNG chạy được E2B vì RAM peak ~1.7GB + iOS Jetsam memory limit. Cần khảo sát device fleet thực tế của Viện DD trong Workshop tuần 1 (Q-WS-08).

### 7.2.4. Estimate latency end-to-end cho OmiScan

Với 1 ảnh 768×768, prompt ~200 tokens, output JSON ~500 tokens (token budget 560/ảnh):

| Phone tier | Backend | Prefill | Decode | Tổng inference | Verdict |
|---|---|---|---|---|---|
| Tier A (iPhone 15+/S24+) GPU E2B output 500 tokens | GPU | ~0.3–0.5s | ~9–10s | **~10s/ảnh** | ❌ Vượt budget |
| Tier A GPU E2B **output 250 tokens** | GPU | ~0.3–0.5s | ~4.5s | **~5s/ảnh** | ✅ OK |
| Tier B (Pixel 8/A53) CPU E2B output 500 tokens | CPU | ~1.5–2s | ~10–12s | **~12–14s/ảnh** | ❌ Vượt budget |
| Tier B CPU E2B + speculative decode + output 250 tokens | CPU | ~1.5–2s | ~5–6s | **~7–8s/ảnh** | ✅ Vừa đủ |

**Insight quan trọng**: Output 500 tokens là **quá nhiều** cho budget 8s. Phải **giới hạn output ≤ 250 tokens** thông qua:
1. JSON schema compact (key viết tắt: `e: 245` thay vì `energy_kcal: 245`)
2. Chỉ emit field có data
3. Tận dụng **Multi-Token Prediction drafters** (speculative decoding) Google công bố cho Gemma 4 — tốc độ ~3x. Trên S26 Ultra với speculative decode, "summarize text" task đạt 46.0 t/s (gấp đôi baseline 21.9)

### 7.2.5. RAM, battery, thermal

- **RAM peak E2B**: 1.45–1.73 GB. An toàn cho 4GB RAM device.
- **Battery / thermal**: Báo cáo cộng đồng "5–15 tokens/sec on E2B trên iPhone 15 Pro", thermal throttling sau 30s liên tục trên A16. Google công bố Gemma 4 "up to 4× faster, 60% less battery" so với Gemma 3.
- **Volume thực tế OmiScan**: 30 nhãn × 4 ảnh = ~120 inference/ngày/device, **trải dài cả ngày** → không sustained → thermal throttling **không phải concern lớn**.

### 7.2.6. Hạn chế đã biết của Gemma 4

| Hạn chế | Mitigation |
|---|---|
| Repetition loops với JSON constrained generation (vLLM #40080) | Bỏ grammar constraint; dùng prompt + post-validate Dart-side bằng `dart:convert`; retry cap 2 lần |
| `think=false` breaks `format` với Ollama (#15260) | Không ảnh hưởng MediaPipe/LiteRT-LM stack — không relevant với OmiScan |

**Sources**: [Gemma 4 model card](https://ai.google.dev/gemma/docs/core/model_card_4), [HuggingFace blog](https://huggingface.co/blog/gemma4), [E2B litert-lm](https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm), [E4B litert-lm](https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm), [Multi-token prediction](https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/), [LiteRT-LM #1850](https://github.com/google-ai-edge/LiteRT-LM/issues/1850).

---

## 7.3. Plugin flutter_gemma — đánh giá production-readiness

### 7.3.1. Trạng thái plugin (cập nhật 05/2026)

| Tiêu chí | Giá trị |
|---|---|
| Phiên bản pub.dev | **0.14.5** (~05/05/2026) |
| Native release tag | `native-v0.11.0-a` (07/05/2026) |
| Likes / Downloads | 330+ likes, 12.1k downloads |
| Open issues GitHub | 25 |
| License | MIT |
| Publisher | mobilepeople.dev (verified) |
| Models hỗ trợ | Gemma 4 E2B/E4B, Gemma 3n, FastVLM, Qwen 3, Phi-4 Mini, DeepSeek R1, SmolLM, EmbeddingGemma |

**Verdict**: Plugin actively maintained, nhưng vẫn ở giai đoạn 0.x → có nguy cơ breaking change. Phiên bản 0.14.0 đã có rewrite lớn (desktop chuyển từ JVM/gRPC sang `dart:ffi`, iOS thêm Metal GPU cho `.litertlm`). **Single point of failure**: 1 maintainer chính (Sasha Denisov), không phải Google official.

### 7.3.2. API surface cho image input

```dart
// Cài model từ network — phù hợp OmiScan vì model >100MB không thể bundle vào IPA
await FlutterGemma.installModel(modelType: ModelType.gemma4)
  .fromNetwork('https://cdn.omiscan.vn/gemma4-e2b-q4.task').install();

// Bật image support, max 4 ảnh / message (cho multi-shot)
final model = await FlutterGemma.getActiveModel(
  maxTokens: 2048,
  supportImage: true,
  maxNumImages: 4,
);

final chat = await model.createChat();
final msg = Message.withImage(
  text: '...prompt extraction nhãn dinh dưỡng...',
  imageBytes: imageBytes,
  isUser: true,
);
```

Plugin tự handle JPEG/PNG → image embeddings.

**Lưu ý cần verify trong PoC**: API hiện thấy `Message.withImage` nhận 1 ảnh; multi-image cần khai báo `maxNumImages: N` — theo doc cho phép, nhưng search results không show example multi-image rõ → **PoC tuần 2 phải verify**.

### 7.3.3. Format model `.task` vs `.litertlm`

| Format | Dùng cho | Engine | Note |
|---|---|---|---|
| **`.task`** | iOS + Android (mobile/web) | MediaPipe LLM Inference | Định dạng chuẩn, nhiều model có sẵn từ Kaggle/Google |
| **`.litertlm`** | Desktop + iOS Metal GPU mới | LiteRT-LM | Định dạng mới, performance tốt hơn, default cho Gemma 4 multimodal |

**Đối với OmiScan chỉ cần `.task`** cho cả iOS và Android — đơn giản hóa pipeline.

### 7.3.4. Bundle/download model

- **KHÔNG bundle** vào IPA/APK (E2B Q4 ≈ 2.5GB → vượt App Store limit, install nặng)
- Dùng `NetworkSource` với CDN OmiGroup (S3 + CloudFront)
- Download lần đầu khi user mở app, hiển thị progress bar; lưu vào app sandbox; verify SHA256
- Pattern này được dùng bởi MS Lens, Adobe Scan, Google Lens

### 7.3.5. Stability concerns và alternatives

| Alternative | Hạn chế | Khuyến nghị |
|---|---|---|
| **gemini_nano_android** (Piero16301) | Android only, Pixel 9+/Honor flagship → KHÔNG cross-platform | Không phù hợp |
| **foundation_models_framework** (iOS 26+ Apple) | **Chỉ text input on-device hiện tại** — không phải VLM | Không phù hợp v1 |
| **llm_toolkit** | Cross-platform nhưng ít chín hơn flutter_gemma | Plan C |
| **Native MethodChannel + MediaPipe SDK trực tiếp** | Cần viết code Swift + Kotlin riêng | **Plan B nếu flutter_gemma không stable** |

**Quyết định**: Dùng `flutter_gemma`, đồng thời chuẩn bị code path Plan B (native MethodChannel) nếu plugin có blocker.

**Sources**: [pub.dev flutter_gemma](https://pub.dev/packages/flutter_gemma), [GitHub DenisovAV/flutter_gemma](https://github.com/DenisovAV/flutter_gemma), [foundation_models_framework](https://pub.dev/packages/foundation_models_framework), [Apple Foundation Models 2025](https://machinelearning.apple.com/research/apple-foundation-models-2025-updates).

---

## 7.4. Bảng so sánh các stack VLM/OCR cho mobile

| Stack | Latency/ảnh | RAM peak | Tiếng Việt | Structured | Flutter mature | License | Khuyến nghị OmiScan |
|---|---|---|---|---|---|---|---|
| **Gemma 4 E2B** (flutter_gemma) | 1–3s GPU / 5–8s CPU | 1.5–1.7 GB | Mạnh (140 ngôn ngữ native, MMMLU 67.4%) | Native JSON | Beta-stable, 1 maintainer | Apache 2.0 | **CHÍNH** |
| Gemma 4 E4B | 3–6s GPU / 10s+ CPU | **3.4 GB** | Mạnh hơn (MMMLU 76.6%) | Native JSON | Same | Apache 2.0 | Optional Tier A |
| Apple Vision `VNRecognizeTextRequest` | ~50–200ms | <100 MB | iOS 16+ hỗ trợ vi-VN | Pure OCR + bbox | Native qua MethodChannel | Apple SDK | OCR fallback iOS |
| Apple Foundation Models (iOS 26+) | Text only | ~3B params | iOS 26.1+ có VN | Có | foundation_models_framework | Apple SDK | KHÔNG (text only) |
| **VNDocumentCameraViewController** | Real-time | <50 MB | N/A (chỉ scan + crop) | Trả ảnh đã deskew | MethodChannel đơn giản | Apple SDK | **DÙNG cho auto-crop iOS** |
| **ML Kit Text Recognition v2** | ~100–300ms | ~50–100 MB | **Chính thức "vi" Latin** | Text + bbox + line/block | google_mlkit_text_recognition (rất chín) | Free, on-device | **DÙNG làm OCR ground-truth** |
| ML Kit Document Scanner API | Real-time | ~100 MB | Auto-crop/deskew, không OCR | Trả ảnh đã clean | Beta, Android only | Free | DÙNG cho auto-crop Android |
| PaddleOCR PP-OCRv5 mobile | ~500ms–2s/ảnh CPU | ~70M params, ~280MB | Có Vietnamese, +13% accuracy vs v4 | Text + bbox | Không có Flutter plugin chính thức | Apache 2.0 | Plan B |
| Tesseract 5 + vie.traineddata | ~2–5s/ảnh mobile | ~50–200 MB | >97% trên scan 200–400 DPI | Text + bbox | tesseract_ocr Flutter (kém maintained) | Apache 2.0 | KHÔNG (chậm hơn ML Kit) |
| MLC LLM (Llama 3.2 Vision / Phi-3.5 Vision) | Tương đương E4B | 3–6 GB | Yếu hơn Gemma 4 cho VN | Có | mlc_llm Flutter ít cập nhật | Apache 2.0 | KHÔNG |
| Qwen 2.5 VL 3B/7B | Mobile chậm | 3B AWQ → 1.35B params | Trung bình | Có | Không có Flutter plugin chín | Apache 2.0 | KHÔNG (chưa optimize mobile) |
| Gemini Nano qua AICore | Sub-100ms text, image multimodal | NPU offload | Có (Gemma 4 backbone) | Có | gemini_nano_android | Google SDK | Chỉ Pixel 9+ → không cross-platform |
| FastVLM (Apple research) | ~250ms multimodal trên iPhone | 0.5B/1.5B/7B | Trung bình | Có (qua flutter_gemma) | Qua flutter_gemma | Apple non-commercial | Optional POC iOS |

**Sources**: [ML Kit Text Recognition v2 languages](https://developers.google.com/ml-kit/vision/text-recognition/v2/languages), [ML Kit Document Scanner](https://developers.google.com/ml-kit/vision/doc-scanner), [PP-OCRv5 multilingual](https://github.com/PaddlePaddle/PaddleOCR/blob/main/docs/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.en.md), [VNRecognizeTextRequest](https://developer.apple.com/documentation/vision/vnrecognizetextrequest/recognitionlanguages), [VNDocumentCameraViewController](https://developer.apple.com/documentation/visionkit/vndocumentcameraviewcontroller), [FastVLM Apple](https://machinelearning.apple.com/research/fast-vision-language-models).

---

## 7.5. Multi-shot strategy

### 7.5.1. Pattern phổ biến (industry)

Microsoft Document Intelligence, AWS Textract, Google Document AI đều theo **modular pipeline architecture**: Capture → Classify → Extract per region. Pattern AWS Textract OCR + Bedrock LLM là baseline tham khảo: OCR cho **authoritative text**, LLM cho **classification + structuring**.

### 7.5.2. Cấu trúc 4 ảnh chuẩn cho OmiScan

OmiScan yêu cầu cán bộ chụp 4 ảnh chuyên biệt cho 1 sản phẩm:

| Ảnh | Nội dung kỳ vọng | Trường extract | Quality criteria |
|---|---|---|---|
| **Front** | Mặt trước bao bì | `name`, `brand`, `weight`, hình ảnh sản phẩm | Tên sản phẩm rõ, brand đọc được |
| **Back** | Mặt sau bao bì | `ingredients`, `manufacturer`, `address`, `nsx_hsd`, `selfDeclarationNumber` | Đọc được toàn bộ text mặt sau |
| **Nutrition** | Cận bảng dinh dưỡng | 5 chỉ tiêu cơ bản theo TT 29/2023 (+ đường tổng số nếu có thêm đường, + chất béo bão hòa nếu chiên rán) | Bảng rõ, không nghiêng > 10° |
| **Extras** (tùy chọn) | Barcode + thông tin bổ sung (cảnh báo, dấu công bố, mã QR) | `barcode`, `warningLabels`, `qrCode` | Nếu thông tin chính chưa đủ |

### 7.5.3. Hai phương án orchestration

**Phương án A — Multi-image trong 1 prompt (recommended cho Tier A)**:
- Một call duy nhất với 4 ảnh + budget 280–560 tokens/ảnh = 1.120–2.240 image tokens
- Prompt: "Đây là 4 ảnh của 1 sản phẩm (front, back, nutrition, extras). Hợp nhất thông tin và xuất JSON theo schema X."
- **Pros**: Model có ngữ cảnh chéo (đọc Brand từ front + ingredients từ back trong cùng turn)
- **Cons**: Total prefill ~2.500–3.500 tokens → trên Tier B Pixel 8 prefill CPU 200 t/s = **12–17s** TTFT, vượt budget

**Phương án B — Per-image extraction + Dart-side merge (recommended cho Tier B, default cho v1)**:
- 4 inference tuần tự, mỗi inference 1 ảnh + prompt chuyên biệt (ví dụ ảnh nutrition table → schema chỉ có nutrient fields)
- Output JSON riêng per image, merge bằng Dart logic + ML Kit OCR cross-check
- **Pros**: Mỗi inference chỉ ~700 tokens prefill → 1–2s GPU; flexible per-image prompt; dễ retry ảnh nào fail
- **Cons**: Mất context chéo. Khắc phục: pass JSON output của ảnh trước vào prompt ảnh sau làm "memory" (system instruction)

**Quyết định**: **Phương án B làm default** cho v1 (đảm bảo Tier B chạy được). Phương án A làm optional toggle cho Tier A muốn accuracy cao hơn.

### 7.5.4. Adaptive multi-shot loop — auto-detect "đã đủ thông tin"

Sau mỗi ảnh xử lý, Dart-side check schema completeness:

```dart
const requiredFields = [
  'product_name', 'energy_kcal', 'protein_g',
  'fat_g', 'carb_g', 'sodium_mg',
];
final filled = requiredFields.where((f) => result[f] != null).length;
final completeness = filled / requiredFields.length;

if (completeness >= 0.8) {
  showSubmitConfirmation();
} else {
  final missing = requiredFields.where((f) => result[f] == null);
  suggestNextShot(missing); // ví dụ "Vui lòng chụp lại bảng dinh dưỡng"
}
```

Pattern này cũng được Roboflow và Mindee dùng cho nutrition label OCR API.

**Sources**: [AWS Textract + Bedrock IDP](https://aws.amazon.com/blogs/machine-learning/intelligent-document-processing-with-amazon-textract-amazon-bedrock-and-langchain/), [Roboflow nutrition label](https://blog.roboflow.com/read-food-labels-computer-vision/), [Mindee Nutrition Facts OCR API](https://www.mindee.com/blog/nutrition-facts-label-ocr-api-streamlining-food-label-compliance-and-data-management).

---

## 7.6. Quality Gate (real-time, < 5s)

Đây là phần Anh đặc biệt yêu cầu — em thiết kế chi tiết cách app **đánh giá ảnh < 5s và yêu cầu chụp lại với lý do rõ ràng**.

### 7.6.1. 5 chiều đánh giá ảnh

| # | Chiều | Kỹ thuật | Threshold | Latency | Lý do hiển thị cho user |
|---|---|---|---|---|---|
| 1 | **Blur (mờ)** | Variance of Laplacian trên grayscale 240×320 (downsample) | variance < 100 → blur | <50ms | "Ảnh bị mờ. Vui lòng giữ máy chắc và chụp lại" |
| 2 | **Exposure (sáng/tối)** | Mean luminance grayscale | <60 → tối, >200 → sáng (0–255) | <10ms | "Ảnh quá tối/sáng. Điều chỉnh ánh sáng" |
| 3 | **Glare (phản quang)** | % pixel saturated (lum > 250) | >5% → glare | <10ms | "Có ánh sáng phản chiếu. Đổi góc chụp" |
| 4 | **Skew (nghiêng)** | VNDetectDocumentSegmentationRequest (iOS) / ML Kit Doc Scanner detect (Android) | góc lệch > 15° → skew | ~100ms | "Ảnh bị nghiêng. Giữ máy thẳng" |
| 5 | **Coverage (che/cắt chữ)** | Quick ML Kit Text Recognition v2 — đếm số block recognized vs expected | < N blocks → reshoot | ~200ms | "Bảng dinh dưỡng bị cắt. Chụp toàn bộ bảng" |

**Tổng latency 1 frame**: ~300–400ms → có thể chạy ở **2–3 fps** trên camera preview, đủ realtime feedback overlay.

### 7.6.2. Implementation cho Flutter

- **OpenCV qua FFI** cho Laplacian + reduce: package `dart_opencv` hoặc tự wrap. Bài Medium đã confirm feasibility với Dart.
- **ML Kit recommend nv21 (Android) / bgra8888 (iOS), 720p hoặc thấp hơn** đủ cho ML task → align downsampling strategy.
- **Apple Vision có VNDetectFaceCaptureQualityRequest** cho mặt người (xét exposure/pose/blur), nhưng KHÔNG có equivalent cho document → **phải tự implement**.
- **Custom TFLite blur model** không cần thiết — Laplacian variance đủ tốt và rẻ hơn 10×.

### 7.6.3. UX feedback overlay

Hiển thị **4 indicator** (Blur / Light / Skew / Coverage) ở góc trên của camera preview, mỗi indicator có 3 trạng thái: 🔴 đỏ (lỗi rõ), 🟡 vàng (chấp nhận được), 🟢 xanh (tốt).

```
┌──────────────────────────────┐
│ 🔴 Blur  🟡 Light  🟢 Skew  🟢 Coverage │   ← Status row
│                              │
│      [camera preview]        │
│                              │
│   ┌──────────────────┐      │
│   │  CHỤP (disabled) │      │   ← Disabled khi có ≥1 đỏ
│   └──────────────────┘      │
│                              │
│   "Ảnh bị mờ. Giữ máy chắc"  │   ← Reason string từ chiều có vấn đề
└──────────────────────────────┘
```

Chỉ enable nút Chụp khi **tất cả 4 indicator ≥ vàng**. Khi user nhấn Chụp, lưu frame với metadata quality.

Tham khảo UX: MS Lens, Adobe Scan, Google Drive Scan — đều có pattern overlay tương tự.

### 7.6.4. Yêu cầu bổ sung ảnh

Khi user đã chụp đủ ≥ 1 ảnh, app chạy adaptive multi-shot loop (mục 7.5.4) → quyết định có cần thêm ảnh không. Hiển thị:

```
✓ Front: đã chụp (name, brand đã đọc)
✓ Back: đã chụp (ingredients, NSX/HSD đã đọc)
✗ Nutrition: cần chụp (chưa có energy_kcal, protein_g)
○ Extras: tùy chọn (barcode đã đọc qua Front)
```

User có thể nhấn "Submit ngay" để bỏ qua, nhưng app sẽ cảnh báo: "Còn thiếu dữ liệu bảng dinh dưỡng. Submit có thể bị flag review."

**Sources**: [Blur detection OpenCV Dart Flutter](https://medium.com/@myavuzokumus/blur-detection-on-an-image-using-dart-with-opencv-1cc98002827c), [PyImageSearch Laplacian](https://pyimagesearch.com/2015/09/07/blur-detection-with-opencv/), [ML Kit Flutter image analyzer](https://cdmunoz.medium.com/detect-before-processing-an-intelligent-image-analyzer-with-flutter-and-ml-kit-730f1a08b67b), [Apple capture quality](https://developer.apple.com/documentation/vision/selecting-a-selfie-based-on-capture-quality).

---

## 7.7. Prompt engineering chống hallucinate

### 7.7.1. Best practices Gemma 4 multimodal

Theo Google official prompt formatting guide:

- **Đặt image trước text**: `<|image|> Đây là bảng dinh dưỡng...`
- **Token budget ≥ 560 cho OCR** (1.120 cho text rất nhỏ trên nhãn)
- **System instruction** với `<|turn>system ...<turn|>` quy định role và schema
- **Few-shot examples**: confirmed cải thiện structured output đáng kể cho food labels (PMC12387780 với GPT-4V/4o + few-shot đạt Cohen's Kappa 0.91 với annotators)

### 7.7.2. Schema đề xuất (compact)

> **Thuật ngữ thống nhất** (theo TT 29/2023): Schema `nutrition_per_100g` có 7 fields = **5 core bắt buộc** (kcal, protein_g, carb_g, fat_g, sodium_mg — Điều 5.1) + **2 conditional** (sugar_g chỉ với sản phẩm thêm đường — Điều 5.2; fat_sat_g chỉ với thực phẩm chiên rán — Điều 5.3). VLM extract tất cả 7 nếu thấy trên ảnh; rule engine validate theo scope.

```json
{
  "name_vi": "Bánh mì sandwich Staff",
  "brand": "Staff",
  "category": "bakery",
  "weight_g": 70,
  "nutrition_per_100g": {
    "kcal": 245,
    "protein_g": 6.5,
    "fat_g": 12.3,
    "fat_sat_g": 2.1,
    "carb_g": 28.0,
    "sugar_g": 5.0,
    "sodium_mg": 320
  },
  "ingredients_text": "Bột mì, đường, dầu thực vật, men, muối...",
  "manufacturer": "...",
  "nsx": "2026-04-15",
  "hsd": "2026-07-15",
  "self_declaration_no": "...",
  "barcode": "8...",
  "confidence": {
    "name_vi": "high",
    "kcal": "high",
    "protein_g": "medium",
    "fat_sat_g": "low"
  }
}
```

**Bắt buộc có `confidence` per field** — bắt model self-report unsure → để Dart-side flag review.

### 7.7.3. Cross-check OCR ↔ VLM (critical chống hallucinate số)

**Nguy cơ cao nhất**: VLM bịa số dinh dưỡng. Research (TRM Labs, F22 Labs, Dataunboxed) đều confirm "VLM hallucinate, missing zero in dollar amount changes meaning entirely."

**Pipeline**:

1. ML Kit Text Recognition v2 → trả về **mọi token text + bbox** (ground-truth raw)
2. Gemma 4 → trả JSON structured
3. **Dart-side validator**:
   - Với mỗi numeric field trong JSON, **regex tìm chính xác chuỗi đó trong OCR raw text**
   - Nếu KHÔNG tìm thấy → `confidence = "hallucinated"`, gắn cờ review
   - Nếu OCR có số khác cho cùng nutrient (ví dụ "Năng lượng 245 kcal" trong OCR nhưng VLM trả "kcal: 254") → flag mismatch
4. **Server reconciliation**: data có ≥ 1 hallucinated/mismatch field → đẩy lên admin queue

**Self-correction loop**: nếu mismatch ≥ 2 fields, app tự retry inference với prompt: "Bạn đã nói energy=254 kcal nhưng OCR thấy '245 kcal'. Hãy đọc lại ảnh."

### 7.7.4. Khoảng trống dữ liệu — bắt buộc pilot 200 nhãn Việt

Nghiên cứu PMC12387780 (08/2025) so sánh GPT-4V vs GPT-4o vs Gemini trên 294 food labels song ngữ EN/AR — **không có Gemma 4 mobile data point** nên đây là gap cần PoC.

**Khuyến nghị**: **Pilot 200 nhãn Việt** (chia 50% phổ thông / 50% nhập khẩu), đo per-field accuracy + hallucination rate Gemma 4 E2B vs E4B trước khi go-live.

**Sources**: [Gemma 4 prompt formatting](https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4), [Bilingual food label LLM PMC12387780](https://pmc.ncbi.nlm.nih.gov/articles/PMC12387780/), [TRM Labs OCR→VLM](https://www.trmlabs.com/resources/blog/from-brittle-to-brilliant-why-we-replaced-ocr-with-vlms-for-image-extraction), [OCR vs VLM hybrid](https://dev.to/kesimo/ocr-vs-vlm-why-you-need-both-and-how-hybrid-approaches-win-5bo4).

---

## 7.8. Kiến trúc đề xuất cuối

```
┌──────────────────────────────────────────────────────────┐
│ STACK CHÍNH (Plan A) — DEFAULT                           │
├──────────────────────────────────────────────────────────┤
│ Camera → Quality Gate (Laplacian + ML Kit doc detect)    │
│   ↓                                                      │
│ Per-image Gemma 4 E2B inference (flutter_gemma)          │
│   • Token budget 560/ảnh                                 │
│   • Output 250 tokens (compact schema)                   │
│   • GPU backend mặc định, fallback CPU nếu fail          │
│   ↓                                                      │
│ Per-image OCR (ML Kit Text Recognition v2)               │
│   ↓                                                      │
│ Cross-check + Merge (Dart side)                          │
│   • Regex-match numeric fields giữa VLM JSON ↔ OCR text  │
│   • Set confidence = hallucinated nếu mismatch           │
│   ↓                                                      │
│ User confirm/edit → Submit (text JSON + ảnh gốc)         │
│                                                          │
│ Auto-crop:                                               │
│   • iOS: VNDocumentCameraViewController                  │
│   • Android: ML Kit Document Scanner                     │
└──────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────┐
│ PLAN B — Server fallback (kích hoạt khi cần)             │
├──────────────────────────────────────────────────────────┤
│ Trigger:                                                 │
│   • Tier B device + > 6 ảnh/sản phẩm                     │
│   • Thermal throttle detected                            │
│   • ≥ 3 hallucinated fields                              │
│   • flutter_gemma crash 3 lần liên tiếp                  │
│                                                          │
│ Server endpoint:                                         │
│   • Cloudflare Workers AI (Gemma 4 E4B/26B)              │
│   • hoặc Gemini 2.5 Flash qua Bedrock                    │
│                                                          │
│ Cost guardrail: 3.300 submission × ~5.000 tokens         │
│   ≈ 16.5M tokens/tháng → ước ~$20–50/tháng               │
│   (cost chưa có trong baseline Section 13 — TBD Sprint 2 │
│   PoC; nếu on-device fails > 10%, estimate TBD)           │
└──────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────┐
│ PLAN C — Pure cloud (panic mode)                         │
├──────────────────────────────────────────────────────────┤
│ Nếu flutter_gemma có blocker production:                 │
│ Tier A: Apple Foundation Models (text only) + ML Kit OCR │
│         → server VLM call cho understanding              │
│ Tier B: ML Kit OCR + server VLM 100%                     │
└──────────────────────────────────────────────────────────┘
```

### 7.8.1. Lý do chọn

1. **Gemma 4 E2B + flutter_gemma**: Stack on-device duy nhất hiện đáp ứng cross-platform iOS+Android, multimodal, JSON output, tiếng Việt mạnh, latency ≤ 8s budget với output 250 tokens, RAM < 2GB phù hợp Tier B
2. **ML Kit OCR ground-truth**: Free, on-device, chính thức hỗ trợ Vietnamese, là **anchor** chống VLM hallucinate
3. **Native scanner views**: Phụ thuộc thấp, latency thấp, UX quen thuộc người dùng (giống MS Lens / Adobe Scan)
4. **Server fallback nhẹ**: Volume 3.300/tháng nhỏ → cost server < $50/tháng kể cả 100% chuyển server

### 7.8.2. Risks & Mitigation

| # | Risk | Mức | Mitigation |
|---|---|---|---|
| AI-R1 | flutter_gemma 1-maintainer abandoned | Cao | Lưu fork nội bộ; chuẩn bị code path Plan B (native MethodChannel + MediaPipe SDK); theo dõi commit cadence hàng tháng |
| AI-R2 | Gemma 4 hallucinate số dinh dưỡng | Cao | Cross-check ML Kit OCR; admin review queue; per-field confidence; self-correction loop |
| AI-R3 | Tier B device thermal throttle | Trung bình | Volume nhẹ (~120 inf/ngày trải dài) → low risk; flag tự động fallback CPU sau N inference |
| AI-R4 | iOS App Store reject vì model size | Thấp | Download lần đầu trong app; pattern dùng bởi Lens/Adobe Scan |
| AI-R5 | Gemma 4 repetition loop với JSON | Trung bình | Bỏ grammar constraint; Dart-side `jsonDecode` + retry cap 2 lần |
| AI-R6 | Multi-image trong 1 prompt không support trên flutter_gemma | Trung bình | Default Phương án B (per-image) — không phụ thuộc multi-image; verify tuần 2 |
| AI-R7 | LiteRT-LM 0.10.0 fail trên Pixel 8 GPU | Trung bình | Pin version 0.10.1+; QA bắt buộc Pixel 8 trước go-live |

### 7.8.3. Pilot bắt buộc trước go-live

| # | Hạng mục | Phương pháp | Tiêu chí pass |
|---|---|---|---|
| **PoC-1** | Benchmark Gemma 4 E2B trên 5 device | Chạy 30 inference liên tục trên: iPhone 12 Pro, iPhone 15 Pro, Pixel 8, Galaxy S24, Galaxy A53. Ghi: TTFT, decode rate, RAM peak, thermal | TTFT ≤ 1s GPU, RAM < 2GB, không crash |
| **PoC-2** | Pilot 200 nhãn Việt | 100 phổ thông + 100 nhập khẩu. Đo per-field accuracy, CER, hallucination rate | Per-field accuracy ≥ 90% trên trường bắt buộc, hallucination ≤ 5% |
| **PoC-3** | A/B Phương án multi-image | So sánh Phương án A (1 prompt 4 ảnh) vs Phương án B (4 prompt) | B mặc định nếu A vượt budget Tier B |
| **PoC-4** | flutter_gemma stability | Smoke test 8h liên tục trên iOS + Android | 0 crash, 0 OOM |
| **PoC-5** | Quality gate latency | Đo latency 5 chiều trên camera preview 720p | Tổng < 500ms/frame, fps ≥ 2 |

---

## 7.9. Tổng kết một dòng

OmiScan dùng **Gemma 4 E2B qua flutter_gemma làm VLM chính + ML Kit Text Recognition v2 làm OCR cross-check + native document scanner cho auto-crop**, với Plan B server fallback (Gemma 4 E4B/Gemini 2.5 Flash) cho edge case và Tier B thermal throttle. **Trước go-live bắt buộc benchmark trên 5 device target + pilot 200 nhãn Việt**.

---

## 7.10. Sources tổng hợp (deduplicated)

### Gemma 4 chính thức

- [Google AI for Developers — Gemma 4 model card](https://ai.google.dev/gemma/docs/core/model_card_4)
- [HuggingFace — Welcome Gemma 4](https://huggingface.co/blog/gemma4)
- [HuggingFace — Gemma 4 E2B litert-lm benchmark](https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm)
- [HuggingFace — Gemma 4 E4B litert-lm benchmark](https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm)
- [Google blog — Gemma 4 announcement](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/)
- [Google blog — Multi-token prediction Gemma 4](https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/)
- [Google AI — Gemma 4 prompt formatting](https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4)
- [Google DeepMind — Gemma 4](https://deepmind.google/models/gemma/gemma-4/)
- [Android Developers — Gemma 4 in AICore Developer Preview](https://android-developers.googleblog.com/2026/04/AI-Core-Developer-Preview.html)

### Plugin Flutter

- [GitHub — DenisovAV/flutter_gemma](https://github.com/DenisovAV/flutter_gemma)
- [pub.dev — flutter_gemma](https://pub.dev/packages/flutter_gemma)
- [pub.dev — gemini_nano_android](https://pub.dev/packages/gemini_nano_android)
- [pub.dev — foundation_models_framework](https://pub.dev/packages/foundation_models_framework)
- [pub.dev — google_mlkit_text_recognition](https://pub.dev/packages/google_mlkit_text_recognition)
- [pub.dev — google_mlkit_document_scanner](https://pub.dev/packages/google_mlkit_document_scanner)

### OCR + Vision

- [Google ML Kit — Text Recognition v2 languages (Vietnamese supported)](https://developers.google.com/ml-kit/vision/text-recognition/v2/languages)
- [Google ML Kit — Document Scanner](https://developers.google.com/ml-kit/vision/doc-scanner)
- [PaddleOCR PP-OCRv5 Multilingual](https://github.com/PaddlePaddle/PaddleOCR/blob/main/docs/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.en.md)
- [HuggingFace — PP-OCRv5](https://huggingface.co/blog/baidu/ppocrv5)
- [Tesseract — Vietnamese traineddata](https://github.com/tesseract-ocr/tessdata/blob/main/vie.traineddata)
- [Apple Developer — VNRecognizeTextRequest](https://developer.apple.com/documentation/vision/vnrecognizetextrequest)
- [Apple Developer — VNDocumentCameraViewController](https://developer.apple.com/documentation/visionkit/vndocumentcameraviewcontroller)
- [Apple ML Research — FastVLM](https://machinelearning.apple.com/research/fast-vision-language-models)
- [Apple ML Research — Foundation Models 2025](https://machinelearning.apple.com/research/apple-foundation-models-2025-updates)

### Quality gate + benchmarks

- [PMC — Bilingual food label LLM extraction](https://pmc.ncbi.nlm.nih.gov/articles/PMC12387780/)
- [Roboflow — Read food labels with CV](https://blog.roboflow.com/read-food-labels-computer-vision/)
- [Mindee Nutrition Facts OCR API](https://www.mindee.com/blog/nutrition-facts-label-ocr-api-streamlining-food-label-compliance-and-data-management)
- [TRM Labs — OCR replaced by VLM](https://www.trmlabs.com/resources/blog/from-brittle-to-brilliant-why-we-replaced-ocr-with-vlms-for-image-extraction)
- [DEV.to — OCR vs VLM hybrid](https://dev.to/kesimo/ocr-vs-vlm-why-you-need-both-and-how-hybrid-approaches-win-5bo4)
- [PyImageSearch — Blur detection OpenCV](https://pyimagesearch.com/2015/09/07/blur-detection-with-opencv/)
- [Medium — Blur detection Dart Flutter OpenCV](https://medium.com/@myavuzokumus/blur-detection-on-an-image-using-dart-with-opencv-1cc98002827c)
- [AWS — Intelligent Document Processing with Textract + Bedrock](https://aws.amazon.com/blogs/machine-learning/intelligent-document-processing-with-amazon-textract-amazon-bedrock-and-langchain/)

### Issues đã biết

- [LiteRT-LM Gemma 4 Pixel 8 issue #1850](https://github.com/google-ai-edge/LiteRT-LM/issues/1850)
- [vLLM Gemma 4 JSON repetition #40080](https://github.com/vllm-project/vllm/issues/40080)