通过基本词汇的维基百科组成语料,供以学习语言和百科。
当前只发布英语版。
背景
通常说的词汇量是指词干数量。有的研究表明认识98%的文本才能基本理解句意。然而统计发现十万词(非词干)不足以覆盖英语98%。AI模型通常2^15=32768词表便已足够。所以尝试人类按照AI模型训练的办法学习。
筛选基本词汇,构建百科语料,学会语言和百科知识。
本文中的词UnicodeTokenizer分词后的词,变形基本是小写化,不包括词干化。
定义词汇等级=-lg(词频)
语料
维基百科(https://en.wikipedia.org/wiki/Ai)
BookCorpus(https://github.com/soskek/bookcorpus)
arxiv abstract(https://www.kaggle.com/datasets/Cornell-University/arxiv)
词典ecdict(https://github.com/skywind3000/ECDICT)
依赖工具
Wiki2txt(https://github.com/laohur/wiki2txt) :wikipedia解析器
UnicodeTokenizer(https://github.com/laohur/UnicodeTokenizer):分词器
词表
等级五内,7k,覆盖87.4%; 等级六内,33k,覆盖95.2%
某些测试称需要几千词汇,是指圈定的几千词干。本项目的此表是频率筛选的绝对词表。
词频表 统计词频至6级,约3.2万词
产出
术语筛选策略 :标题纯英文单词、且六级词汇以内、有合适释文。
产出2197个条目,词频降序。每24篇合一章,共约一百章。每篇解释一个术语,术语标题、释文、单词释义。
正文包含五级词汇:7103、六级词汇:17540、六级之外的生僻词:15474,覆盖92%词频。正文全长63万词。
排版样例
只释义首句
In physics, a quantum (plural quanta) is the minimum amount of any physical entity (physical property) involved in an interaction. The fundamental notion that a physical property can be "quantized" is referred to as "the hypothesis of quantization". This means that the magnitude of the physical property can take on only discrete values consisting of integer multiples of one quantum. For example, a photon is a single quantum of light (or of any other form of electromagnetic radiation). Similarly, the energy of an electron bound within an atom is quantized and can exist only in certain discrete values. (Atoms and matter in general are stable because electrons can exist only at discrete energy levels within an atom.) Quantization is one of the foundations of the much broader physics of quantum mechanics. Quantization of energy and its influence on how energy and matter interact (quantum electrodynamics) is part of the fundamental framework for understanding and describing nature.
word | phonetic | definition | translation | root | lemma | degre |
quantum | 'kwɒntәm | n. a discrete amount of something that is analogous to the quantities in quantum theory n. (physics) the smallest discrete quantity of some physical property that a system can possess (according to quantum theory) | n. 量, 量子 [计] 量子 | | | 3.93 |
quanta | 'kwɒntә | n a discrete amount of something that is analogous to the quantities in quantum theory n (physics) the smallest discrete quantity of some physical property that a system can possess (according to quantum theory) | pl. 量, 量子 [医] 量子, 量 | | quantum | 6.13 |
minimum | 'minimәm | n. the smallest possible quantity n. the point on a curve where the tangent changes from negative on the left to positive on the right | a. 最小的, 最低的 n. 最小值 [计] 最小值 | min | | 4.51 |
用法
翻阅各篇看术语,确定自己词汇量,从不会的开始。
每篇文章可以只看首句。
蓝字含扩展链接。
索引
Learning Language as AI
Section 000
Section 001
Section 002
Section 003
Section 004
Section 005
Section 006
Section 007
Section 008
Section 009
Section 010
Section 011
Section 012
Section 013
Section 014
Section 015
Section 016
Section 017
Section 018
Section 019
Section 020
Section 021
Section 022
Section 023
Section 024
Section 025
Section 026
Section 027
Section 028
Section 029
Section 030
Section 031
Section 032
Section 033
Section 034
Section 035
Section 036
Section 037
Section 038
Section 039
Section 040
Section 041
Section 042
Section 043
Section 044
Section 045
Section 046
Section 047
Section 048
Section 049
Section 050
Section 051
Section 052
Section 053
Section 054
Section 055
Section 056
Section 057
Section 058
Section 059
Section 060
Section 061
Section 062
Section 063
Section 064
Section 065
Section 066
Section 067
Section 068
Section 069
Section 070
Section 071
Section 072
Section 073
Section 074
Section 075
Section 076
Section 077
Section 078
Section 079
Section 080
Section 081
Section 082
Section 083
Section 084
Section 085
Section 086
Section 087
Section 088
Section 089
Section 090
Section 091