A Frequency Dictionary of PATD
A frequency dictionary of printed Arabic text is essential for natural language processing.
It includes a 1,251 documents XML file corpus of Arabic documents collected from ten Arabic newspapers and magazines from different countries,
which were created as the PATD database.
Utilizing 2,344 articles to extract 1,102,078 tokens,19,926 sentences, and
1,000,000
words, this dictionary provides detailed information for each word, including English equivalents, usage statistics, usage distribution, and the most widely used words. Users can access the top words through an alphabetical index arranged by Arabic roots.
Word Frequency List
The corpus contains 51,847 words, listed from highest frequency to lowest, along with the rank frequency, the English equivalent, relative document frequency, average reduced frequency, and average logarithmic distance.
N-gram Frequency List
The N-gram and keyword analysis includes the most contiguous sequence used at each gram level and provides the rank frequency, relative document frequency, average reduced frequency, and average logarithmic distance in 2-gram, 3-gram, 4-gram, 5-gram, and 6-gram statistics.
Thematic Vocabulary List
According to the corpus's nature, there are 22 different topics, with the most words used in each topic listed along with their rank frequency:
Animals, Clothing, Emotions, Colors, Materials, Weather, Food, Technology, Movement, Health, Environment, Nature, Body, Religion, Family, Communication, Transportation, Professions, Sports, War, Security, Economy, Business, Time, Politics, and Laws.
Samples
Word |
English equivalent |
Frequency |
DOCF |
RDOCF |
ARF |
ALDF |
يوم | Day | 620 | 336 | 26.85851 | 321.9292 | 302.61615 |
اذا | If | 618 | 309 | 24.70024 | 293.48804 | 246.70552 |
محمد | Mohammed | 613 | 294 | 23.5012 | 268.10938 | 223.05666 |
رغم | despite | 591 | 318 | 25.41966 | 308.73767 | 286.13867 |
جميع | all | 590 | 354 | 28.29736 | 315.19504 | 291.32956 |
الي | to | 586 | 360 | 28.77698 | 337.17566 | 336.33005 |
الامن | Security | 582 | 362 | 20.94325 | 231.18333 | 141.46347 |
ليس | Not | 577 | 306 | 24.46043 | 274.45099 | 240.66968 |
سنوات | Years | 571 | 302 | 24.14069 | 297.78357 | 275.75275 |
حين | when | 565 | 302 | 24.14069 | 283.90131 | 264.00684 |
This frequency dictionary is considered a valuable resource of modern Arabic vocabulary for many specialists, students, and learners.
The frequency dictionary is freely available to interested researchers, with the hope that it will assist the Arabic language processing research community.