A Frequency Dictionary of Printed Arabic Text
A Frequency Dictionary of PATD

A frequency dictionary of printed Arabic text is essential for natural language processing. It includes a 1,251 documents XML file corpus of Arabic documents collected from ten Arabic newspapers and magazines from different countries, which were created as the PATD database. Utilizing 2,344 articles to extract 1,102,078 tokens,19,926 sentences, and 1,000,000 words, this dictionary provides detailed information for each word, including English equivalents, usage statistics, usage distribution, and the most widely used words. Users can access the top words through an alphabetical index arranged by Arabic roots.




Word Frequency List

The corpus contains 51,847 words, listed from highest frequency to lowest, along with the rank frequency, the English equivalent, relative document frequency, average reduced frequency, and average logarithmic distance.


N-gram Frequency List

The N-gram and keyword analysis includes the most contiguous sequence used at each gram level and provides the rank frequency, relative document frequency, average reduced frequency, and average logarithmic distance in 2-gram, 3-gram, 4-gram, 5-gram, and 6-gram statistics.


Thematic Vocabulary List

According to the corpus's nature, there are 22 different topics, with the most words used in each topic listed along with their rank frequency: Animals, Clothing, Emotions, Colors, Materials, Weather, Food, Technology, Movement, Health, Environment, Nature, Body, Religion, Family, Communication, Transportation, Professions, Sports, War, Security, Economy, Business, Time, Politics, and Laws.




Samples

Word English equivalent Frequency DOCF RDOCF ARF ALDF
يوم Day 620 336 26.85851 321.9292 302.61615
اذا If 618 309 24.70024 293.48804 246.70552
محمد Mohammed 613 294 23.5012 268.10938 223.05666
رغم despite 591 318 25.41966 308.73767 286.13867
جميع all 590 354 28.29736 315.19504 291.32956
الي to 586 360 28.77698 337.17566 336.33005
الامن Security 582 362 20.94325 231.18333 141.46347
ليس Not 577 306 24.46043 274.45099 240.66968
سنواتYears 571 302 24.14069 297.78357 275.75275
حين when 565 302 24.14069 283.90131 264.00684


This frequency dictionary is considered a valuable resource of modern Arabic vocabulary for many specialists, students, and learners. The frequency dictionary is freely available to interested researchers, with the hope that it will assist the Arabic language processing research community.