A Novel Feature Hashing for Text Mining


ŞEKER Ş. E. , Mert C.

International Black Sea University Journal of Technical Science & Technologies, cilt.2, ss.37-40, 2013 (Hakemli Üniversite Dergisi)

  • Cilt numarası: 2 Konu: 1
  • Basım Tarihi: 2013
  • Dergi Adı: International Black Sea University Journal of Technical Science & Technologies
  • Sayfa Sayıları: ss.37-40

Özet

 

Because of the increasing studies on the big data, holding text as data source, the importance of feature hashing has a major role in the literature now. A usual way of text mining on big data, mostly requires a layer of feature hashing, which reduces the size of feature vector.[WU1]  For example getting the word count yields hundreds of thousands of features in most of the cases and taking the pos-tagging would reduce this number into features about 50. By the feature hashing the size of feature vector reduces reasonably and the data mining processes like classification, clustering or association can run faster. And in some cases, executing some algorithms is impossible with current hardware, where parallel or distributed programming takes into account.

The feature hashing approaches usually can be categorized into two groups. The first group deals with natural language processing (NLP) algorithms and tries to extract a relatively smarter hash results, which represents the input characteristics at maximum or the mathematical hashing algorithms, which do not deal with the context or meaning of the text input and just processes the input for some binary output. For example POS-Tagging approaches can carry on some features of the input to the output on the other hand hashing algorithms like MD5 or SHA-1 has no effect of input, where they only worry about the less collision on the output.

This study focus on the second group of hashing algorithms and criticize the hashing algorithms using Feistel Network which are widely utilized in the text mining studies. We propose a new approach which is mainly built on the substitution boxes (s-boxes), which is in the core of all Feistel Networks and processes the text faster than the other implementations.


 [WU1]Burda data dan sonraki virgulu kaldirsak mi. yuklem belli degil.Because of the increasing studies on the big data, holding text as data source, the importance of feature hashing has a major role in the literature now. A usual way of text mining on big data, mostly requires a layer of feature hashing, which reduces the size of feature vector.[WU1]  For example getting the word count yields hundreds of thousands of features in most of the cases and taking the pos-tagging would reduce this number into features about 50. By the feature hashing the size of feature vector reduces reasonably and the data mining processes like classification, clustering or association can run faster. And in some cases, executing some algorithms is impossible with current hardware, where parallel or distributed programming takes into account.

The feature hashing approaches usually can be categorized into two groups. The first group deals with natural language processing (NLP) algorithms and tries to extract a relatively smarter hash results, which represents the input characteristics at maximum or the mathematical hashing algorithms, which do not deal with the context or meaning of the text input and just processes the input for some binary output. For example POS-Tagging approaches can carry on some features of the input to the output on the other hand hashing algorithms like MD5 or SHA-1 has no effect of input, where they only worry about the less collision on the output.

This study focus on the second group of hashing algorithms and criticize the hashing algorithms using Feistel Network which are widely utilized in the text mining studies. We propose a new approach which is mainly built on the substitution boxes (s-boxes), which is in the core of all Feistel Networks and processes the text faster than the other implementations.

 

 

 

Because of the increasing studies on the big data, holding text as data source, the importance of feature hashing has a major role in the literature now. A usual way of text mining on big data, mostly requires a layer of feature hashing, which reduces the size of feature vector.[WU1]  For example getting the word count yields hundreds of thousands of features in most of the cases and taking the pos-tagging would reduce this number into features about 50. By the feature hashing the size of feature vector reduces reasonably and the data mining processes like classification, clustering or association can run faster. And in some cases, executing some algorithms is impossible with current hardware, where parallel or distributed programming takes into account.

The feature hashing approaches usually can be categorized into two groups. The first group deals with natural language processing (NLP) algorithms and tries to extract a relatively smarter hash results, which represents the input characteristics at maximum or the mathematical hashing algorithms, which do not deal with the context or meaning of the text input and just processes the input for some binary output. For example POS-Tagging approaches can carry on some features of the input to the output on the other hand hashing algorithms like MD5 or SHA-1 has no effect of input, where they only worry about the less collision on the output.

This study focus on the second group of hashing algorithms and criticize the hashing algorithms using Feistel Network which are widely utilized in the text mining studies. We propose a new approach which is mainly built on the substitution boxes (s-boxes), which is in the core of all Feistel Networks and processes the text faster than the other implementations.


 [WU1]Burda data dan sonraki virgulu kaldirsak mi. yuklem belli degil.Because of the increasing studies on the big data, holding text as data source, the importance of feature hashing has a major role in the literature now. A usual way of text mining on big data, mostly requires a layer of feature hashing, which reduces the size of feature vector.[WU1]  For example getting the word count yields hundreds of thousands of features in most of the cases and taking the pos-tagging would reduce this number into features about 50. By the feature hashing the size of feature vector reduces reasonably and the data mining processes like classification, clustering or association can run faster. And in some cases, executing some algorithms is impossible with current hardware, where parallel or distributed programming takes into account.

The feature hashing approaches usually can be categorized into two groups. The first group deals with natural language processing (NLP) algorithms and tries to extract a relatively smarter hash results, which represents the input characteristics at maximum or the mathematical hashing algorithms, which do not deal with the context or meaning of the text input and just processes the input for some binary output. For example POS-Tagging approaches can carry on some features of the input to the output on the other hand hashing algorithms like MD5 or SHA-1 has no effect of input, where they only worry about the less collision on the output.

This study focus on the second group of hashing algorithms and criticize the hashing algorithms using Feistel Network which are widely utilized in the text mining studies. We propose a new approach which is mainly built on the substitution boxes (s-boxes), which is in the core of all Feistel Networks and processes the text faster than the other implementations.