summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorELPA Syncer <elpasync@gnu.org>2024-05-28 18:58:00 -0400
committerELPA Syncer <elpasync@gnu.org>2024-05-28 18:58:00 -0400
commit2e91739eda16221df269ef029d86e454a9ac1fe2 (patch)
treed4aae4dd724f898d83d1a779bcbb80bc16e8251f
parent2852aa6170d1fad2b6d2e3226cfcfa5067eb9b35 (diff)
parenta17203d26135b970e4d7c5d101955d41303a758f (diff)
Merge remote-tracking branch 'refs/remotes/upstream/guess-language/main' into elpa--merge/guess-languageexternals/guess-language
-rw-r--r--README.org80
-rw-r--r--guess-language.el119
-rw-r--r--testdata/all_supported_languages.org4
-rw-r--r--trigrams/eo300
-rw-r--r--trigrams/sr_LAT299
-rw-r--r--trigrams/vi300
6 files changed, 1032 insertions, 70 deletions
diff --git a/README.org b/README.org
index 9ea20b7..aadf031 100644
--- a/README.org
+++ b/README.org
@@ -12,11 +12,16 @@ Emacs minor mode that detects the language of what you're typing. Automatically
- Stays out of your way. Set up once, then forget it exists.
- Works with documents written in multiple languages.
------
+*News:*
+- [2022-04-08 Fri] Added support for displaying current language as color emoji in the mode line (requires Emacs 28.1 or later). See updated format of ~guess-language-langcodes~. Old configurations should still work but won’t show color flags.
+
+* What is guess-language?
I write a lot of text in multiple languages and was getting tired of constantly having to switch the dictionary of my spell-checker. In true Emacs spirit, I decided to dust off my grandpa's parentheses and wrote some code to address this problem. The result is ~guess-language-mode~, a minor mode for Emacs that guesses the language of the current paragraph and then changes the dictionary of ispell and the language settings of typo-mode (if present). It also reruns Flyspell on the current paragraph, but only on that paragraph because I want to leave paragraphs in other languages untouched. Language guessing is triggered when Flyspell detects an unknown word, but only if the paragraph has enough material to allow for robust detection of the language (~ 35 characters).
-Currently, the following languages are supported: Arabic, Czech, Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Polish, Portuguese, Russian, Slovak, Slovenian, Spanish, Swedish. It is very easy to add more languages and this repository includes the necessary language statistics for 49 additional languages. (These were copied from [[https://github.com/kent37/guess-language][guess_language.py]].)
+Currently, the following languages are supported: Arabic, Czech, Danish, Dutch, English, Esperanto, Finnish, French, German, Italian, Norwegian, Polish, Portuguese, Russian, Serbian (Cyrillic and Latin), Slovak, Slovenian, Spanish, Swedish, Vietnamese.
+
+It is easy to add more languages and this repository includes the necessary language statistics for 47 additional languages. (These were copied from [[https://github.com/kent37/guess-language][guess_language.py]].) If we already have the required language data (see directory [[https://github.com/tmalsburg/guess-language.el/tree/master/trigrams][trigrams]]), all you need to do is to add an entry to the variable ~guess-language-langcodes~. See [[https://github.com/tmalsburg/guess-language.el/commit/bbafdeaf380c41e4546510df7c257b898b702d65][here]] for the commit that added support for Serbian. See the code of [[https://github.com/jorgenschaefer/typoel][typo-mode]] to determine the quoting style needed for the language that you’re adding. An overview of quoting styles across languages can be found on [[https://en.wikipedia.org/wiki/Quotation_mark][Wikipedia]]. PRs adding new languages are welcome.
** Prerequisites
@@ -52,11 +57,11 @@ In this case, use the variable ~guess-language-langcodes~ to tell guess-language
#+BEGIN_SRC elisp
(setq guess-language-langcodes
- '((en . ("en_GB" "English"))
- (de . ("de_CH" "German"))))
+ '((en . ("en_GB" "English" "🇬🇧" "English"))
+ (de . ("de_CH" "German" "🇨🇭" "Swiss German"))))
#+END_SRC
-The key of each entry in this alist is an ISO 639-1 language code. The first element of the value is the name of the dictionary that should be used (i.e., what you would enter when setting the language via ~M-x ispell-change-dictionary~). The second element is the name of the language setting that should be used with typo-mode (if present). If a language is not supported by typo-mode or if you are not using typo-mode, enter ~nil~.
+The key is a symbol specifying the ISO 639-1 code of the language. The values is a list with four elements. The first is the name of the dictionary that should be used by the spell-checker (e.g., what you would enter when setting the language with ~ispell-change-dictionary~). The second element is the name of the language setting that should be used with typo-mode. If a language is not supported by typo-mode, that value is ~nil~. The third element is a string for displaying the current language in the mode line. This could be text or a Unicode flag symbol (displayed as color emoji starting from Emacs 28.1). The last element is the name of the language for display in the mini buffer.
For a list of all dictionaries available for spell-checking, use the following:
@@ -66,25 +71,29 @@ For a list of all dictionaries available for spell-checking, use the following:
Languages that are currently supported by guess-language-mode:
-| Language | IDO 639-1 code | Default Ispell dictionary | Default typo-mode setting |
-|------------+----------------+---------------------------+---------------------------|
-| Arabic | ~ar~ | ar | |
-| Czech | ~cs~ | czech | Czech |
-| Danish | ~da~ | dansk | |
-| Dutch | ~nl~ | nederlands | |
-| English | ~en~ | en | English |
-| Finnish | ~fi~ | finnish | Finnish |
-| French | ~fr~ | francais | French |
-| German | ~de~ | de | German |
-| Italian | ~it~ | italiano | Italian |
-| Norwegian | ~nb~ | norsk | |
-| Polish | ~pl~ | polish | |
-| Portuguese | ~pt~ | portuguese | |
-| Russian | ~ru~ | russian | Russian |
-| Slovak | ~sk~ | slovak | |
-| Slovenian | ~sl~ | slovenian | |
-| Spanish | ~es~ | spanish | |
-| Swedish | ~sv~ | svenska | |
+| Language | IDO 639-1 code | Default Ispell dictionary | Default typo-mode setting |
+|--------------------+----------------+---------------------------+----------------------------------|
+| Arabic | ~ar~ | ar | |
+| Czech | ~cs~ | czech | Czech |
+| Danish | ~da~ | dansk | |
+| Dutch | ~nl~ | nederlands | |
+| English | ~en~ | en | English |
+| Esperanto | ~eo~ | esperanto | English |
+| Finnish | ~fi~ | finnish | Finnish |
+| French | ~fr~ | francais | French |
+| German | ~de~ | de | German |
+| Italian | ~it~ | italiano | Italian |
+| Norwegian | ~nb~ | norsk | |
+| Polish | ~pl~ | polish | |
+| Portuguese | ~pt~ | portuguese | |
+| Russian | ~ru~ | russian | Russian |
+| Serbian (Cyrillic) | ~sr~ | serbian | German (most similar to Serbian) |
+| Serbian (Latin) | ~sr~ | sr-lat | German (most similar to Serbian) |
+| Slovak | ~sk~ | slovak | |
+| Slovenian | ~sl~ | slovenian | |
+| Spanish | ~es~ | spanish | |
+| Swedish | ~sv~ | svenska | |
+| Vietnamese | ~vi~ | viet | |
*** Custom functions to be run when a new language is detected
@@ -131,6 +140,27 @@ which LANG was detected but these are ignored."
(add-hook 'guess-language-after-detection-functions #'guess-language-switch-festival-function)
#+END_SRC
-The ~pcase~ needs to be modified to use the voiced that are installed on your system. Refer to the documentation of Festival for details.
+The ~pcase~ needs to be modified to use the voices that are installed on your system. Refer to the documentation of Festival for details.
+
+*** Changing the language of Synosaurus
+
+[[https://github.com/hpdeifel/synosaurus][Synosaurus]] is an Emacs package providing access to a German or English thesaurus. Using the code below the language of the thesaurus is automatically changed to the language of the current paragraph. Refer to the documentation of Synosaurus for details.
+
+#+BEGIN_SRC elisp
+(defun guess-language-switch-synosaurus (lang beginning end)
+ "Switch the thesaurus language.
+
+LANG is the ISO 639-1 code of the language (as a
+symbol). BEGINNING and END are the endpoints of the region in
+which LANG was detected. These are ignored."
+ (when (featurep 'synosaurus)
+ (pcase lang
+ ('en (setq synosaurus-backend 'synosaurus-backend-wordnet))
+ ('de (setq synosaurus-backend 'synosaurus-backend-openthesaurus)))))
+
+(add-hook 'guess-language-after-detection-functions #'guess-language-switch-synosaurus)
+#+END_SRC
+** Notes
+- Support for Latin Serbian is based on trigrams transliterated from Cyrillic Serbian. Since some Cyrillic trigrams transliterate to 4-grams in Latin, we truncated those but as a result have two duplicates, "e n" and "ra ". Not ideal but the results are probably still robust enough. Nonetheless, it would be good if someone could compute proper Latin trigrams one day.
diff --git a/guess-language.el b/guess-language.el
index bb30fda..08ddb23 100644
--- a/guess-language.el
+++ b/guess-language.el
@@ -4,6 +4,8 @@
;; Author: Titus von der Malsburg <malsburg@posteo.de>
;; Maintainer: Titus von der Malsburg <malsburg@posteo.de>
+;; Description: Robust automatic language detection
+;; Keywords: wp
;; Version: 0.0.1
;; Package-Requires: ((cl-lib "0.5") (emacs "24") (nadvice "0.1"))
;; URL: https://github.com/tmalsburg/guess-language.el
@@ -38,12 +40,12 @@
;; such that users can do things like changing the input method when
;; needed.
;;
-;; The detection algorithm is based on counts of character
-;; trigrams. At this time, supported languages are Arabic, Czech,
-;; Danish, Dutch, English, Finnish, French, German, Italian,
-;; Norwegian, Polish, Portuguese, Russian, Slovak, Slovenian, Spanish,
-;; Swedish. Adding further languages is very easy and this package
-;; already contains language statistics for 49 additional languages.
+;; The detection algorithm is based on counts of character trigrams. At this
+;; time, supported languages are Arabic, Czech, Danish, Dutch, English,
+;; Esperanto, Finnish, French, German, Italian, Norwegian, Polish, Portuguese,
+;; Russian, Serbian, Slovak, Slovenian, Spanish, Swedish and Vietnamese. Adding
+;; further languages is very easy and this package already contains language
+;; statistics for 49 additional languages.
;;; Code:
@@ -61,11 +63,12 @@ dictionary, input methods, etc."
(defcustom guess-language-languages '(en de fr)
"List of languages that should be considered.
-Uses ISO 639-1 identifiers. Currently supported languages are:
-Arabic (ar), Czech (cs), Danish (da), Dutch (nl), English (en),
-Finnish (fi), French (fr), German (de), Italian (it),
-Norwegian (nb), Polish (pl), Portuguese (pt), Russian (ru),
-Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv)"
+Uses ISO 639-1 identifiers. Currently supported languages are:
+Arabic (ar), Czech (cs), Danish (da), Dutch (nl), English (en),
+Esperanto (eo), Finnish (fi), French (fr), German (de),
+Italian (it), Norwegian (nb), Polish (pl), Portuguese (pt),
+Russian (ru), Slovak (sk), Slovenian (sl), Spanish (es),
+Swedish (sv) and Vietnamese (vi)."
:type '(repeat symbol))
(defcustom guess-language-min-paragraph-length 40
@@ -80,33 +83,41 @@ little material to reliably guess the language."
"The regular expressions that are used to count trigrams.")
(defcustom guess-language-langcodes
- '((ar . ("ar" nil))
- (cs . ("czech" "Czech"))
- (da . ("dansk" nil))
- (de . ("de" "German"))
- (en . ("en" "English"))
- (es . ("spanish" nil))
- (fi . ("finnish" "Finnish"))
- (fr . ("francais" "French"))
- (it . ("italiano" "Italian"))
- (nb . ("norsk" nil))
- (nl . ("nederlands" nil))
- (pl . ("polish" nil))
- (pt . ("portuguese" nil))
- (ru . ("russian" "Russian"))
- (sk . ("slovak" nil))
- (sl . ("slovenian" nil))
- (sv . ("svenska" nil)))
+ '((ar . ("ar" nil "اَلْعَرَبِيَّةُ" "Arabic"))
+ (cs . ("czech" "Czech" "🇨🇿" "Czech"))
+ (da . ("dansk" nil "🇩🇰" "Danish"))
+ (de . ("de" "German" "🇩🇪" "German"))
+ (en . ("en" "English" "🇬🇧" "English"))
+ (eo . ("eo" "Esperanto" "🟩" "Esperanto"))
+ (es . ("spanish" nil "🇪🇸" "Spanish"))
+ (fi . ("finnish" "Finnish" "🇫🇮" "Finnish"))
+ (fr . ("francais" "French" "🇫🇷" "French"))
+ (it . ("italiano" "Italian" "🇮🇹" "Italian"))
+ (nb . ("norsk" nil "🇳🇴" "Norsk"))
+ (nl . ("nederlands" nil "🇳🇱" "Dutch"))
+ (pl . ("polish" "Polish" "🇵🇱" "Polish"))
+ (pt . ("portuguese" nil "🇵🇹" "Portuguese"))
+ (ru . ("russian" "Russian" "🇷🇺" "Russian"))
+ (sk . ("slovak" nil "🇸🇰" "Slovak"))
+ (sl . ("slovenian" nil "🇸🇮" "Slovenian"))
+ (sr . ("serbian" "Serbian" "🇷🇸" "Serbian"))
+ (sr_LAT . ("sr-lat" "Serbian" "🇷🇸" "Serbian"))
+ (sv . ("svenska" "Swedish" "🇸🇪" "Swedish"))
+ (vi . ("viet" nil "🇻🇳" "Vietnamese")))
"Language codes for spell-checker and typo-mode.
The key is a symbol specifying the ISO 639-1 code of the
-language. The values is a list with two elements. The first is
+language. The values is a list with four elements. The first is
the name of the dictionary that should be used by the
spell-checker (e.g., what you would enter when setting the
language with `ispell-change-dictionary'). The second element is
the name of the language setting that should be used with
typo-mode. If a language is not supported by typo-mode, that
-value is nil."
+value is nil. The third element is a string for displaying the
+current language in the mode line. This could be text or a
+Unicode flag symbol (displayed as color emoji starting from Emacs
+28.1). The last element is the name of the language for display
+in the mini buffer."
:type '(alist :key-type symbol :value-type list))
(defcustom guess-language-after-detection-functions (list #'guess-language-switch-flyspell-function
@@ -132,6 +143,9 @@ By default it's the same directory where this module is installed."
Uses ISO 639-1 to identify languages.")
(make-variable-buffer-local 'guess-language-current-language)
+(defvar-local guess-language--post-command-h #'ignore
+ "Function called by `guess-language--post-command-h'.")
+
(defun guess-language-load-trigrams ()
"Load language statistics."
(cl-loop
@@ -148,9 +162,8 @@ Uses ISO 639-1 to identify languages.")
"Compile regular expressions used for guessing language."
(setq guess-language--regexps
(cl-loop
- for lang in (guess-language-load-trigrams)
- for regexp = (mapconcat 'identity (cdr lang) "\\|")
- collect (cons (car lang) regexp))))
+ for (lang . regexps) in (guess-language-load-trigrams)
+ collect (cons lang (regexp-opt regexps)))))
(defun guess-language-backward-paragraph ()
"Uses whatever method for moving to the previous paragraph is
@@ -183,9 +196,8 @@ Region starts at BEGINNING and ends at END."
(when (cl-set-exclusive-or guess-language-languages (mapcar #'car guess-language--regexps))
(guess-language-compile-regexps))
(let ((tally (cl-loop
- for lang in guess-language--regexps
- for regexp = (cdr lang)
- collect (cons (car lang) (how-many regexp beginning end)))))
+ for (lang . regexp) in guess-language--regexps
+ collect (cons lang (how-many regexp beginning end)))))
(car (cl-reduce (lambda (x y) (if (> (cdr x) (cdr y)) x y)) tally))))
(defun guess-language-buffer ()
@@ -218,7 +230,7 @@ things like changing the keyboard layout or input method."
(let ((lang (guess-language-region beginning end)))
(run-hook-with-args 'guess-language-after-detection-functions lang beginning end)
(setq guess-language-current-language lang)
- (message (format "Detected language: %s" (caddr (assoc lang guess-language-langcodes))))))))
+ (message (format "Detected language: %s" (nth 4 (assoc lang guess-language-langcodes))))))))
(defun guess-language-function (_beginning _end _doublon)
"Wrapper for `guess-language' because `flyspell-incorrect-hook'
@@ -228,6 +240,13 @@ provides three arguments that we don't need."
;; words:
nil)
+(defun guess-language--post-command-h ()
+ "The `post-command-hook' used by guess-language.
+
+Used by `guess-language-switch-flyspell-function' to recheck the
+spelling of the current paragraph after switching dictionary."
+ (funcall guess-language--post-command-h))
+
(defun guess-language-switch-flyspell-function (lang beginning end)
"Switch the Flyspell dictionary and recheck the current paragraph.
@@ -245,13 +264,15 @@ which LANG was detected."
;; from flyspell-incorrect-hook that called us. Otherwise, the
;; word at point is highlighted as incorrect even if it is
;; correct according to the new dictionary.
- (run-at-time 0 nil
- (lambda ()
- (let ((flyspell-issue-welcome-flag nil)
- (flyspell-issue-message-flag nil)
- (flyspell-incorrect-hook nil)
- (flyspell-large-region 1))
- (flyspell-region beginning end)))))))
+ (setq guess-language--post-command-h
+ (lambda ()
+ (setq guess-language--post-command-h #'ignore)
+ (let ((flyspell-issue-welcome-flag nil)
+ (flyspell-issue-message-flag nil)
+ (flyspell-incorrect-hook nil)
+ (flyspell-large-region 1))
+ (with-local-quit
+ (flyspell-region beginning end))))))))
(defun guess-language-switch-typo-mode-function (lang _beginning _end)
"Switch the language used by typo-mode.
@@ -289,13 +310,21 @@ correctly."
;; The initial value.
:init-value nil
;; The indicator for the mode line.
- :lighter (:eval (format " (%s)" (or guess-language-current-language "default")))
+ :lighter (:eval (format " %s" (or
+ (nth 3 (assq guess-language-current-language guess-language-langcodes))
+ ;; Options for users of old configurations:
+ (nth 2 (assq guess-language-current-language guess-language-langcodes))
+ (nth 1 (assq guess-language-current-language guess-language-langcodes))
+ "default")))
:global nil
(if guess-language-mode
(progn
(add-hook 'flyspell-incorrect-hook #'guess-language-function nil t)
+ ;; Depth of 92 to ensure placement after flyspell's PCH
+ (add-hook 'post-command-hook #'guess-language--post-command-h 92 t)
(advice-add 'flyspell-buffer :around #'guess-language-flyspell-buffer-wrapper))
(remove-hook 'flyspell-incorrect-hook #'guess-language-function t)
+ (remove-hook 'post-command-hook #'guess-language--post-command-h t)
(advice-remove 'flyspell-buffer #'guess-language-flyspell-buffer-wrapper)))
(defun guess-language-mark-lines (&optional highlight)
diff --git a/testdata/all_supported_languages.org b/testdata/all_supported_languages.org
index f5ed965..1ea8858 100644
--- a/testdata/all_supported_languages.org
+++ b/testdata/all_supported_languages.org
@@ -9,6 +9,8 @@ de: Dies ist ein kurzer Text zu Testzwecken geschrieben und übersetzt in mehrer
en: This is a short text written for testing purposes and translated to several languages using Google Translate.
+eo: Ĉi tiu estas mallonga teksto skribita por elprov celoj kaj tradukitajn kelkajn lingvojn uzantan Google Traduki.
+
es: Este es un texto corto escrito para propósitos de prueba y traducido a varios idiomas usando Google Translate.
fi: Tämä on lyhyt teksti kirjoitettu testausta varten ja käännetty useita kieliä Google kääntää.
@@ -32,3 +34,5 @@ sk: Jedná sa o krátky text písaný na účely testovania a preložené do nie
sl: To je kratko besedilo, napisano za testiranje in prevedena v več jezikov, ki uporabljajo Google Translate.
sv: Detta är en kort text skriven för teständamål och översatt till flera språk med Google Translate.
+
+vi: Đây là một văn bản ngắn được viết cho mục đích thử nghiệm và được dịch sang một số ngôn ngữ bằng Google Dịch.
diff --git a/trigrams/eo b/trigrams/eo
new file mode 100644
index 0000000..266134a
--- /dev/null
+++ b/trigrams/eo
@@ -0,0 +1,300 @@
+ la
+la
+ de
+de
+aj
+oj
+as
+is
+en
+ en
+ ka
+est
+o d
+ es
+kaj
+e l
+to
+sta
+o e
+io
+o k
+on
+ ko
+ro
+ta
+tas
+ al
+a k
+ pr
+n l
+a a
+ po
+ ki
+ ma
+o l
+jn
+ant
+ li
+a p
+ist
+s l
+nto
+sti
+j k
+no
+ita
+tis
+do
+an
+ent
+ re
+aŭ
+j e
+kon
+li
+toj
+ran
+n k
+ ti
+s e
+el
+al
+a s
+ in
+ter
+aro
+ an
+a m
+a e
+ia
+n d
+ojn
+per
+ s
+j d
+ se
+nta
+str
+sto
+a l
+ pl
+mo
+a d
+ ĝi
+ si
+ tr
+and
+s k
+o p
+lo
+j l
+tra
+par
+ pa
+unu
+pro
+ono
+o a
+nte
+j p
+ no
+ ku
+te
+mal
+taj
+ el
+kom
+iu
+art
+roj
+ ja
+ĝis
+ mo
+lan
+ra
+a r
+s a
+ vi
+era
+tro
+gra
+er
+e k
+ori
+n e
+ di
+ata
+int
+s p
+o s
+a f
+ko
+a t
+j a
+n p
+ ek
+kiu
+na
+ne
+ pe
+e e
+e d
+da
+ili
+l l
+ado
+ank
+ver
+por
+men
+e a
+ ne
+man
+ me
+ du
+un
+ un
+ato
+kun
+mon
+ali
+ste
+ajn
+dis
+tri
+rio
+j s
+ lo
+ara
+pre
+ te
+ gr
+oni
+kie
+nom
+jar
+nda
+i e
+ĝi
+noj
+kto
+ero
+n s
+igi
+cio
+e s
+a v
+a n
+or
+pri
+e p
+ fo
+ ĉe
+iĝi
+s s
+n a
+ ha
+eri
+ ar
+ndo
+a u
+ont
+ano
+lia
+iel
+ost
+ris
+ fa
+ort
+iko
+lin
+ari
+ ĉi
+ri
+iaj
+ion
+mun
+ ve
+ino
+tor
+ sa
+loj
+co
+nis
+ton
+ aŭ
+e m
+ona
+rto
+aci
+spe
+ala
+ple
+for
+o t
+vas
+olo
+tiu
+jo
+pos
+kaŭ
+re
+j m
+nio
+ fi
+ st
+o m
+ ba
+tan
+a j
+ekt
+ ge
+ons
+s m
+omo
+ing
+ mi
+omu
+a b
+a i
+ten
+enc
+res
+ika
+rbo
+vis
+nka
+pli
+ a
+ mu
+iuj
+tem
+hav
+ kr
+ na
+ila
+alo
+ ke
+aĵo
+umo
+i l
+ani
+ova
+num
+r l
+urb
+ron
+ ap
+am
+tat
+tur
+cia
+ ri
+ovi
+ava
+ntr
+ or
+ejo
+nst
+ka
diff --git a/trigrams/sr_LAT b/trigrams/sr_LAT
new file mode 100644
index 0000000..e478ae8
--- /dev/null
+++ b/trigrams/sr_LAT
@@ -0,0 +1,299 @@
+ na
+ je
+ po
+je
+ i
+ ne
+ pr
+ra
+ cv
+og
+a s
+ih
+na
+koj
+oga
+ u
+a p
+ne
+ni
+ti
+ da
+om
+ ve
+ sr
+i s
+sko
+ ob
+a n
+da
+e n
+no
+nog
+o j
+oj
+ za
+va
+e s
+i p
+ma
+nik
+obr
+ova
+ ko
+a i
+dij
+e n
+ka
+ko
+kog
+ost
+sve
+stv
+sti
+tra
+edi
+ima
+pok
+pra
+raz
+te
+ bo
+ vi
+ sa
+avo
+bra
+gos
+e i
+eli
+eni
+za
+iki
+io
+pre
+rav
+rad
+u s
+ju
+nja
+ bi
+ do
+ st
+ast
+boj
+ebo
+i n
+im
+ku
+lan
+neb
+ovo
+ogo
+osl
+ojš
+ped
+str
+čas
+ go
+ kr
+ mo
+ čl
+a m
+a o
+ako
+ača
+vel
+vet
+vog
+eda
+ist
+iti
+ije
+oko
+slo
+srb
+čla
+ be
+ os
+ ot
+ re
+ se
+a v
+an
+bog
+bro
+ven
+gra
+e o
+ika
+ija
+kih
+kom
+li
+nu
+ota
+ojn
+pod
+rbs
+red
+roj
+sa
+sni
+tač
+tva
+ja
+ji
+ ka
+ ov
+ tr
+a j
+avi
+az
+ano
+bio
+vik
+vo
+gov
+dni
+e č
+ego
+i o
+iva
+ivo
+ik
+ine
+ini
+ipe
+kip
+lik
+lo
+naš
+nos
+o t
+od
+odi
+ona
+oji
+poč
+pro
+ra
+ris
+rod
+rst
+se
+spo
+sta
+tić
+u d
+u n
+u o
+čin
+ša
+jed
+jni
+će
+ m
+ me
+ ni
+ on
+ pa
+ sl
+ te
+a u
+ava
+ave
+avn
+ana
+ao
+ati
+aci
+aju
+anj
+bsk
+vor
+vos
+vsk
+din
+e u
+edn
+ezi
+eka
+eno
+eto
+enj
+živ
+i g
+i i
+i k
+i t
+iku
+ičk
+ki
+krs
+la
+lav
+lit
+me
+men
+nac
+o n
+o p
+o u
+odn
+oli
+orn
+osn
+osp
+oče
+psk
+reč
+rps
+svo
+ski
+sla
+srp
+su
+ta
+tav
+tve
+u b
+jez
+ći
+ en
+ ži
+ im
+ mu
+ od
+ su
+ ta
+ hr
+ ča
+ št
+ nj
+a d
+a z
+a k
+a t
+adu
+alo
+ani
+aso
+van
+vač
+ved
+vi
+vno
+vot
+voj
+vu
+dob
+dru
+dse
+du
+e b
+e d
+e m
+em
+ema
+ent
+enc
diff --git a/trigrams/vi b/trigrams/vi
new file mode 100644
index 0000000..e135c6a
--- /dev/null
+++ b/trigrams/vi
@@ -0,0 +1,300 @@
+ng
+ th
+nh
+ ch
+ tr
+hể
+ lo
+ nh
+oại
+loạ
+ể l
+Thể
+ại:
+ và
+ ph
+ ng
+ kh
+n t
+g t
+ củ
+ên
+ủa
+của
+ông
+và
+ là
+ qu
+ác
+ gi
+g c
+ đư
+ân
+ cá
+n c
+là
+ược
+ợc
+i t
+ong
+đượ
+các
+c t
+ại
+ới
+ột
+ có
+i c
+n đ
+có
+ăm
+ron
+ời
+ nă
+iên
+ến
+ mộ
+ành
+g n
+một
+hàn
+g đ
+ch
+ ti
+ Th
+năm
+n n
+ện
+tro
+ất
+t t
+ày
+h t
+c c
+ộc
+i n
+thu
+ài
+u t
+hi
+ình
+inh
+ười
+gườ
+chi
+uộc
+ đã
+iến
+đã
+i đ
+an
+g v
+hiệ
+ngư
+hôn
+n v
+hiế
+ ''
+ nà
+ang
+iện
+ết
+ươn
+ơng
+n l
+ào
+cho
+ho
+với
+ vớ
+t c
+t đ
+m 1
+g l
+h c
+ Tr
+c đ
+ độ
+ần
+ tạ
+ều
+này
+ay
+àng
+ 19
+n,
+ vi
+ số
+i v
+ ở
+ườn
+a c
+ật
+ung
+ ba
+ai
+nhi
+ùng
+ính
+ận
+iều
+ờng
+, t
+a t
+ Ch
+c n
+ước
+áng
+huộ
+. T
+uân
+ầu
+ida
+anh
+như
+khô
+thà
+n s
+ững
+ộng
+hán
+ bi
+n b
+dae
+hữn
+ống
+ họ
+ đi
+ải
+phá
+hân
+ực
+, c
+ia
+n h
+huy
+ Ph
+gia
+am
+g h
+vào
+ớc
+c l
+ từ
+y t
+ hi
+ao
+, n
+ây
+iệt
+ền
+g b
+ra
+àn
+thá
+ản
+ngh
+ về
+ức
+ệt
+ập
+khi
+u c
+oàn
+về
+nhữ
+m t
+à c
+ốc
+ cô
+o t
+ đó
+ Nh
+au
+i l
+ ho
+ sa
+số
+át
+eTh
+t n
+g,
+ đế
+ Ng
+quâ
+à t
+h đ
+g k
+ng,
+qua
+từ
+''
+c h
+trư
+ội
+c v
+hư
+ hà
+ ra
+on
+the
+h v
+i:T
+a đ
+, v
+uốc
+hín
+a n
+hiê
+chí
+ối
+m c
+thể
+i h
+đến
+tiế
+ đầ
+o c
+oài
+ái
+ Qu
+iết
+ó t
+côn
+eo
+tra
+. N
+ Na
+ách
+ ôn
+g m
+u đ
+.Th
+o n
+tại
+g s
+ bị
+heo
+gày
+aeT
+ 20
+y c
+c b
+bị
+hất
+t h
+i,
+ lạ
+hế
+đầu
+ dâ
+hủ
+ ha
+ran
+ánh
+ọc
+để
+ để
+ ki
+ vậ
+ng.