Title: | Arabic Stemmer for Text Analysis |
---|---|
Description: | Allows users to stem Arabic texts for text analysis. |
Authors: | Rich Nielsen |
Maintainer: | Rich Nielsen <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.3 |
Built: | 2025-02-18 05:41:02 UTC |
Source: | https://github.com/cran/arabicStemR |
This package is a stemmer for texts in Arabic (Modern Standard). The stemmer is loosely based on the light 10 stemmer, but with a number of modifications.
Use the stemArabic
function.
Maintainer: Rich Nielsen <[email protected]>
## generate some text in Arabic x <- "\u628\u633\u645 \u0627\u0644\u0644\u0647 \u0627\u0644\u0631\u062D\u0645\u0646 \u0627\u0644\u0631\u062D\u064A\u0645" ## stem and transliterate stemArabic(x) ## stem while not stemming certain words stem(x, dontStemTheseWords = c("alr7mn")) ## stem and return the stemlist out <- stemArabic(x,returnStemList=TRUE) out$text out$stemlist
## generate some text in Arabic x <- "\u628\u633\u645 \u0627\u0644\u0644\u0647 \u0627\u0644\u0631\u062D\u0645\u0646 \u0627\u0644\u0631\u062D\u064A\u0645" ## stem and transliterate stemArabic(x) ## stem while not stemming certain words stem(x, dontStemTheseWords = c("alr7mn")) ## stem and return the stemlist out <- stemArabic(x,returnStemList=TRUE) out$text out$stemlist
Cleans any characters in string that are not in either the Latin unicode range or in the Arabic alphabet
cleanChars(texts)
cleanChars(texts)
texts |
A string from which characters which are not Latin or Arabic should be removed. |
cleanChars
returns a string with only Latin and Arabic characters.
Rich Nielsen
## Create string with Arabic, latin, and Hebrew characters x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 Hello \u05d0' ## Remove characters from string that are not Arabic or latin cleanChars(x)
## Create string with Arabic, latin, and Hebrew characters x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 Hello \u05d0' ## Remove characters from string that are not Arabic or latin cleanChars(x)
Cleans Latin characters from a string
cleanLatinChars(texts)
cleanLatinChars(texts)
texts |
A string from which Latin characters should be removed. |
cleanLatinChars
returns a string with Latin characters removed.
Rich Nielsen
## Create string with Arabic and latin characters x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 Hello' ## Rewmove latin characters from string cleanLatinChars(x)
## Create string with Arabic and latin characters x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 Hello' ## Rewmove latin characters from string cleanLatinChars(x)
Removes prefixes and suffixes, and can return a list matching the words to stemmed words. Does not stem different forms of Allah.
doStemming(texts, dontstem = c('\u0627\u0644\u0644\u0647','\u0644\u0644\u0647'))
doStemming(texts, dontstem = c('\u0627\u0644\u0644\u0647','\u0644\u0644\u0647'))
texts |
The original texts. |
dontstem |
By default, does not stem different forms of Allah |
doStemming
returns a named list with the following elements:
text |
The stemmed text |
stemmedWords |
A list matching the words and the stemmed words. |
Rich Nielsen
## Create string with Arabic characters x <- '\u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629 \u062c\u0645\u064a\u0644\u0629 \u062c\u062f\u0627' ## Remove prefixes and suffixes y<-doStemming(x) y$text y$stemmedWords
## Create string with Arabic characters x <- '\u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629 \u062c\u0645\u064a\u0644\u0629 \u062c\u062f\u0627' ## Remove prefixes and suffixes y<-doStemming(x) y$text y$stemmedWords
Standardize different hamzas on alif seats in a string.
fixAlifs(texts)
fixAlifs(texts)
texts |
A string from which different alifs are standardized. |
fixAlifs
returns a string with standardized alifs.
Rich Nielsen
## Create string with Arabic characters x <- '\u0622 \u0623 \u0675' ## Standardize Alifs fixAlifs(x)
## Create string with Arabic characters x <- '\u0622 \u0623 \u0675' ## Standardize Alifs fixAlifs(x)
Removes Arabic numerals from a string.
removeArabicNumbers(texts)
removeArabicNumbers(texts)
texts |
A string from which Arabic numerals should be removed. |
removeArabicNumbers
returns a string with Arabic numerals removed.
Rich Nielsen
## Create string with Arabic characters and numbers x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 \u0661\u0662\u0663' ## Remove Arabic numbers removeArabicNumbers(x)
## Create string with Arabic characters and numbers x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 \u0661\u0662\u0663' ## Remove Arabic numbers removeArabicNumbers(x)
Removes diacritics from Arabic unicode text.
removeDiacritics(texts)
removeDiacritics(texts)
texts |
A string from which Arabic diacritics should be removed. |
removeDiacritics
returns a string with Arabic diacritics removed.
Rich Nielsen
## Create string with Arabic characters and diacritics x<- '\u0627\u0647\u0644\u0627\u064b \u0648\u0633\u0647\u0644\u0627\u064b' ## Remove diacritics removeDiacritics(x)
## Create string with Arabic characters and diacritics x<- '\u0627\u0647\u0644\u0627\u064b \u0648\u0633\u0647\u0644\u0627\u064b' ## Remove diacritics removeDiacritics(x)
Removes Arabic numerals from a string.
removeEnglishNumbers(texts)
removeEnglishNumbers(texts)
texts |
A string from which English numerals should be removed. |
removeEnglishNumbers
returns a string with English numerals removed.
Rich Nielsen
## Create string with Arabic characters and English number x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 123' ## Remove English Numbers removeNumbers(x)
## Create string with Arabic characters and English number x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 123' ## Remove English Numbers removeNumbers(x)
Removes Farsi numerals from a string.
removeFarsiNumbers(texts)
removeFarsiNumbers(texts)
texts |
A string from which Farsi numerals should be removed. |
removeFarsiNumbers
returns a string with Arabic numerals removed.
Rich Nielsen
## Create string with Arabic characters and numbers x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 \u06f1\u06f2\u06f3\u06f4\u06f5' ## Remove Farsi numbers removeFarsiNumbers(x)
## Create string with Arabic characters and numbers x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 \u06f1\u06f2\u06f3\u06f4\u06f5' ## Remove Farsi numbers removeFarsiNumbers(x)
Removes new line characters from a string.
removeNewlineChars(texts)
removeNewlineChars(texts)
texts |
A string from which new line characters should be removed. |
removeNewlineChars
returns a string with new line characters removed.
Rich Nielsen
## Create string with Arabic characters x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 \u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627' ## Remove newline characters (gets rid of \n\r\t\f\v) removeNewlineChars(x)
## Create string with Arabic characters x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 \u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627' ## Remove newline characters (gets rid of \n\r\t\f\v) removeNewlineChars(x)
Removes English, Arabic, and Farsi numerals from a string.
removeNumbers(texts)
removeNumbers(texts)
texts |
A string from which English, Arabic, and Farsi numerals should be removed. |
removeNumbers
returns a string with English, Arabic, and Farsi numerals removed.
Rich Nielsen
## Create string with Arabic characters and number x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 123 \u0661\u0662\u0663' ## Remove Numbers removeNumbers(x)
## Create string with Arabic characters and number x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 123 \u0661\u0662\u0663' ## Remove Numbers removeNumbers(x)
Removes some Arabic prefixes from a unicode string. The prefixes are: "waw", "alif-lam", "waw-alif-lam", "ba-alif-lam", "kaf-alif-lam", "fa-alif-lam", and "lam-lam." Prefixes are removed from a word (as defined by spaces) only if the remaining stem would not be too short.
removePrefixes(texts, x1 = 4, x2 = 4, x3 = 5, x4 = 5, x5 = 5, x6 = 5, x7 = 4, dontstem = c('\u0627\u0644\u0644\u0647','u0644\u0644\u0647'))
removePrefixes(texts, x1 = 4, x2 = 4, x3 = 5, x4 = 5, x5 = 5, x6 = 5, x7 = 4, dontstem = c('\u0627\u0644\u0644\u0647','u0644\u0644\u0647'))
texts |
An Arabic-language string in unicode |
x1 |
The number of letters that must be in a word for the function to remove the prefix "waw". |
x2 |
The number of letters that must be in a word for the function to remove the prefix "alif-lam". |
x3 |
The number of letters that must be in a word for the function to remove the prefix "waw-alif-lam". |
x4 |
The number of letters that must be in a word for the function to remove the prefix "ba-alif-lam". |
x5 |
The number of letters that must be in a word for the function to remove the prefix "kaf-alif-lam". |
x6 |
The number of letters that must be in a word for the function to remove the prefix "fa-alif-lam". |
x7 |
The number of letters that must be in a word for the function to remove the prefix "lam-lam". |
dontstem |
Words that should not be stemmed (entered in unicode). |
Returns a string with Arabic prefixes removed.
Rich Nielsen
## Create string with Arabic characters x <- '\u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629 \u062c\u0645\u064a\u0644\u0629 \u062c\u062f\u0627' # Remove Prefixes removePrefixes(x)
## Create string with Arabic characters x <- '\u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629 \u062c\u0645\u064a\u0644\u0629 \u062c\u062f\u0627' # Remove Prefixes removePrefixes(x)
Removes punctuation from a string, including some specialized Arabic characters.
removePunctuation(texts)
removePunctuation(texts)
texts |
A string from which punctuation should be removed. |
Returns a string with punctuation removed.
Rich Nielsen
## Create string with Arabic characters and punctuation x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627!!!?' ## Remove punctuation removePunctuation(x)
## Create string with Arabic characters and punctuation x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627!!!?' ## Remove punctuation removePunctuation(x)
Defines a list of Arabic-language stopwords and removes them from a string.
removeStopWords(texts, defaultStopwordList=TRUE, customStopwordList=NULL)
removeStopWords(texts, defaultStopwordList=TRUE, customStopwordList=NULL)
texts |
A string from which Arabic stopwords should be removed. |
defaultStopwordList |
If TRUE, use the default stopword list of words to be removed. If FALSE, do not use the default stopword list. Default is TRUE. |
customStopwordList |
Optional user-specified stopword list of words to be removed, supplied as a vector of strings in either Arabic UTF-8 or Latin characters following the stemmer's transliteration scheme (words without Arabic UTF-8 characters are processed with reverse.transliterate()). Default is NULL. |
Returns a string with Arabic stopwords removed.
Rich Nielsen
## Create string with Arabic characters x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 \u064a\u0627 \u0635\u062f\u064a\u0642\u064a' ## Remove stop words removeStopWords(x)$text ## Not run ## To see the full list of stop words removeStopWords(x)$arabicStopwordList
## Create string with Arabic characters x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627 \u064a\u0627 \u0635\u062f\u064a\u0642\u064a' ## Remove stop words removeStopWords(x)$text ## Not run ## To see the full list of stop words removeStopWords(x)$arabicStopwordList
Removes some Arabic suffixes from a unicode string. The suffixes (in order of removal) are: "ha-alif", "alif-nun", "alif-ta", "waw-nun", "yah-nun", "yah-heh", "yah-ta marbutta", "heh", "ta marbutta", and "yah." Suffixes are removed from a word (as defined by spaces) only if the remaining stem would not be too short. Only one suffix is removed from each word.
removeSuffixes(texts, x1 = 4, x2 = 4, x3 = 4, x4 = 4, x5 = 4, x6 = 4, x7 = 4, x8 = 3, x9 = 3, x10 = 3, dontstem = c('\u0627\u0644\u0644\u0647','u0644\u0644\u0647'))
removeSuffixes(texts, x1 = 4, x2 = 4, x3 = 4, x4 = 4, x5 = 4, x6 = 4, x7 = 4, x8 = 3, x9 = 3, x10 = 3, dontstem = c('\u0627\u0644\u0644\u0647','u0644\u0644\u0647'))
texts |
An Arabic-language string in unicode. |
x1 |
The number of letters that must be in a word for the function to remove the suffix "ha-alif". |
x2 |
The number of letters that must be in a word for the function to remove the suffix "alif-nun". |
x3 |
The number of letters that must be in a word for the function to remove the suffix "alif-ta". |
x4 |
The number of letters that must be in a word for the function to remove the suffix "waw-nun". |
x5 |
The number of letters that must be in a word for the function to remove the suffix "yah-nun". |
x6 |
The number of letters that must be in a word for the function to remove the suffix "yah-heh". |
x7 |
The number of letters that must be in a word for the function to remove the suffix "yah-ta marbutta". |
x8 |
The number of letters that must be in a word for the function to remove the suffix "heh". |
x9 |
The number of letters that must be in a word for the function to remove the suffix "ta marbutta". |
x10 |
The number of letters that must be in a word for the function to remove the suffix "yah". |
dontstem |
Words that should not be stemmed (entered in unicode). |
Returns a string with Arabic suffixes removed.
Rich Nielsen
## Create string with Arabic characters x <- '\u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629 \u062c\u0645\u064a\u0644\u0629 \u062c\u062f\u0627' # Remove Suffixes removeSuffixes(x)
## Create string with Arabic characters x <- '\u0627\u0644\u0644\u063a\u0629 \u0627\u0644\u0639\u0631\u0628\u064a\u0629 \u062c\u0645\u064a\u0644\u0629 \u062c\u062f\u0627' # Remove Suffixes removeSuffixes(x)
Transliterates latin characters into Arabic unicode characters using a transliteration system developed by Rich Nielsen.
reverse.transliterate(texts)
reverse.transliterate(texts)
texts |
A string in latin characters to be transliterated into Arabic characters. |
Returns a string in Arabic characters.
Rich Nielsen
## Create latin string following the arabicStemR package transliteration scheme. x <- 'al3rby' ## Convert latin characters into Arabic unicode characters reverse.transliterate(x)
## Create latin string following the arabicStemR package transliteration scheme. x <- 'al3rby' ## Convert latin characters into Arabic unicode characters reverse.transliterate(x)
Allows users to stem Arabic texts for text analysis. Now deprecated. Please use stemArabic.
stem(dat, cleanChars = TRUE, cleanLatinChars = TRUE, transliteration = TRUE, returnStemList = FALSE, defaultStopwordList=TRUE, customStopwordList=NULL, dontStemTheseWords = c("allh", "llh"))
stem(dat, cleanChars = TRUE, cleanLatinChars = TRUE, transliteration = TRUE, returnStemList = FALSE, defaultStopwordList=TRUE, customStopwordList=NULL, dontStemTheseWords = c("allh", "llh"))
dat |
The original data, as a vector of length one containing the text. |
cleanChars |
Removes all unicode characters except Latin characters and Arabic alphabet |
cleanLatinChars |
Removes Latin characters |
transliteration |
Transliterates the text |
returnStemList |
Performs stemming by removing prefixes and suffixes |
defaultStopwordList |
If TRUE, use the default stopword list of words to be removed. If FALSE, do not use the default stopword list. Default is TRUE. |
customStopwordList |
Optional user-specified stopword list of words to be removed, supplied as a vector of strings in either Arabic UTF-8 or Latin characters following the stemmer's transliteration scheme (words without Arabic UTF-8 characters are processed with reverse.transliterate()). Default is NULL. |
dontStemTheseWords |
Optional vector of strings that should not be stemmed. These words can be supplied as transliterated Arabic (according to the transliteration scheme of transliterate() and reverse.transliterate()) or in unicode Arabic. If a term matches an element of this argument at any intermediate point in stemming, that term will not be stemmed further. The default is c("allh","llh") because in most applications, stemming these common words for "God" creates some confusion by resulting in the string "lh". |
stem
prepares texts in Arabic for text analysis by stemming.
stem
returns a named list with the following elements:
text |
The stemmed text |
stemlist |
A list of the stemmed words. |
Rich Nielsen
## generate some text in Arabic x <- "\u628\u633\u645 \u0627\u0644\u0644\u0647 \u0627\u0644\u0631\u062D\u0645\u0646 \u0627\u0644\u0631\u062D\u064A\u0645" ## stem and transliterate ## NOTE: the "stem()" function only accepts a vector of length 1. ## The function is deprecated in favor of stemArabic() which accepts vectors with multiple elements. stem(x) ## stem while not stemming certain words stem(x, dontStemTheseWords = c("alr7mn")) ## stem and return the stemlist out <- stem(x,returnStemList=TRUE) out$text out$stemlist
## generate some text in Arabic x <- "\u628\u633\u645 \u0627\u0644\u0644\u0647 \u0627\u0644\u0631\u062D\u0645\u0646 \u0627\u0644\u0631\u062D\u064A\u0645" ## stem and transliterate ## NOTE: the "stem()" function only accepts a vector of length 1. ## The function is deprecated in favor of stemArabic() which accepts vectors with multiple elements. stem(x) ## stem while not stemming certain words stem(x, dontStemTheseWords = c("alr7mn")) ## stem and return the stemlist out <- stem(x,returnStemList=TRUE) out$text out$stemlist
Allows users to stem Arabic texts for text analysis.
stemArabic(dat, cleanChars = TRUE, cleanLatinChars = TRUE, transliteration = TRUE, returnStemList = FALSE, defaultStopwordList=TRUE, customStopwordList=NULL, dontStemTheseWords = c("allh", "llh"))
stemArabic(dat, cleanChars = TRUE, cleanLatinChars = TRUE, transliteration = TRUE, returnStemList = FALSE, defaultStopwordList=TRUE, customStopwordList=NULL, dontStemTheseWords = c("allh", "llh"))
dat |
The original data, as a vector of texts. |
cleanChars |
Removes all unicode characters except Latin characters and Arabic alphabet |
cleanLatinChars |
Removes Latin characters |
transliteration |
Transliterates the text |
returnStemList |
Performs stemming by removing prefixes and suffixes |
defaultStopwordList |
If TRUE, use the default stopword list of words to be removed. If FALSE, do not use the default stopword list. Default is TRUE. |
customStopwordList |
Optional user-specified stopword list of words to be removed, supplied as a vector of strings in either Arabic UTF-8 or Latin characters following the stemmer's transliteration scheme (words without Arabic UTF-8 characters are processed with reverse.transliterate()). Default is NULL. |
dontStemTheseWords |
Optional vector of strings that should not be stemmed. These words can be supplied as transliterated Arabic (according to the transliteration scheme of transliterate() and reverse.transliterate()) or in unicode Arabic. If a term matches an element of this argument at any intermediate point in stemming, that term will not be stemmed further. The default is c("allh","llh") because in most applications, stemming these common words for "God" creates some confusion by resulting in the string "lh". |
stemArabic
prepares texts in Arabic for text analysis by stemming.
stemArabic
returns a named list with the following elements:
text |
The stemmed text |
stemlist |
A list of the stemmed words. |
Rich Nielsen
## generate some text in Arabic x <- "\u628\u633\u645 \u0627\u0644\u0644\u0647 \u0627\u0644\u0631\u062D\u0645\u0646 \u0627\u0644\u0631\u062D\u064A\u0645" ## inspect print(x) ## stem and transliterate stemArabic(x) ## stem while not stemming certain words stem(x, dontStemTheseWords = c("alr7mn")) ## stem and return the stemlist out <- stemArabic(x,returnStemList=TRUE) out$text out$stemlist
## generate some text in Arabic x <- "\u628\u633\u645 \u0627\u0644\u0644\u0647 \u0627\u0644\u0631\u062D\u0645\u0646 \u0627\u0644\u0631\u062D\u064A\u0645" ## inspect print(x) ## stem and transliterate stemArabic(x) ## stem while not stemming certain words stem(x, dontStemTheseWords = c("alr7mn")) ## stem and return the stemlist out <- stemArabic(x,returnStemList=TRUE) out$text out$stemlist
Transliterates Arabic unicode characters into latin characters using a transliteration system developed by Rich Nielsen.
transliterate(texts)
transliterate(texts)
texts |
A string in Arabic characters to be transliterated into latin characters. |
Returns a string in latin characters.
Rich Nielsen
## Create Arabic string x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627' ## Performs transliteration of Arabic into latin characters. transliterate(x)
## Create Arabic string x <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627' ## Performs transliteration of Arabic into latin characters. transliterate(x)