How To Eliminate Noise Problem In Data Analysis

Aman Goel

How to manage noisy data

Noisy data is a data that has relatively signal-to-noise ratio. When data is collected, humans tend to make mistakes and instruments tend to be inaccurate, so the collected data has some mistake spring to information technology. This error is referred to equally noise.

Noise creates problem for machine learning algorithms because if non trained properly, algorithms can think of noise to be a pattern and can start generalizing from information technology, which of course is undesirable. We ideally want the algorithm to make sense of the information and generalize the underlying properties of the information. This is prevented past dissonance which tends to "fool" the algorithm to making incorrect generalizations. Therefore, it becomes important for any data scientist to take intendance of noise when applying any car learning algorithm over a noisy data.

In social club to manage noisy data, here are some techniques that are extensively used:

Collecting more than data

The simplest way to handle noisy data is to collect more data. The more data yous collect, the better will you be able to place the underlying miracle that is generating the data. This will eventually help in reducing the outcome of dissonance. Recall about it – when survey companies carry surveys, they practise information technology on a mass calibration. This is considering a handful of survey responses might non be good for generalizing because humans tend to be moody and so, some may answer the survey negatively because of a perhaps bad mood (noisy data). This may not reflect the actual behavior of the masses unless the survey is conducted on a really large scale.

As a rule of thumb – the larger the sample size, the amend will you lot be able to uncover the actual behavior of the population.

PCA

The Principal Component Assay (PCA) is a method of the family of the data analysis and more usually multivariate statistics, which consists of transforming linked variables (called "correlated" into statistics) into new variables that are uncorrelated from each other. These new variables are called "master components", or main axes. It allows the practitioner to reduce the number of variables and make the data less redundant.

Information technology is an approach that is both geometric (the variables are represented in a new infinite, according to maximum inertia directions) and statistical (the inquiry on independent axes explaining at best the variability – the variance – Datas). When y'all want to compress a set of random variables, the main axes of principal component analysis are a better choice, from the point of view of inertia or variance.

PCA effectively reduces the dimension of the input data by projecting it along various axes. For example, consider projecting a point in X-Y airplane along X-axis. This way, we are able to remove the (possibly) noisy dimension – Y-Axis. This do is also referred to equally "dimensionality reduction". PCA is therefore widely used to reduce noise from data by "forgetting" the axes that contain the noisy data.

Regularization

The core of a Machine Learning algorithm is the ability to learn and generalize from the dataset that the algorithm has seen. However, if the algorithm is given enough flexibility (more parameters), then it may happen that the algorithm "overfits" the noisy data. This ways that the algorithm is fooled into believing that the dissonance function of data also represents a pattern. In order to avoid that, one commonly used technique is called as Regularization. In regularization, a penalty term is added to the algorithm's price function, which represents the size of the weights (parameters) of the algorithm. This ensures that for the minimization of the price, the weights are smaller thereby leading to lesser freedom for the algorithm. This greatly helps in avoiding overfitting. There are 2 normally used techniques in regularization:

L1 regularization: In L1 regularization, a term of |w_i| is added for each i. The modulus function is ever positive and so, the regularization term leads to an increase in the price part.
L2 regularization: In L2 regularization, a term of w_i ² is added. Since foursquare is a positive function, then here also the regularization term leads to an increase in the cost part.

In order to minimize the cost, the optimizer tries to take lower values of the weights, thereby leading to less flexibility of the algorithm. This avoids overfitting which in plow helps in handling noisy data hands.

Cantankerous Validation

Cantankerous-validation is a technique that helps in tackling with noisy information by preventing overfitting. This is just similar overfitting. In cantankerous-validation, the dataset is broken into 3 sets (rather than 2):

Training information
Cross validation data
Testing data

The algorithm is trained using the preparation data. However, the hyper-parameters are tuned using the cantankerous-validation data which is divide from the grooming data. This makes sure that the algorithm is able to avoid learning the noise present in the training data and rather generalize by a cantankerous-validation procedure. Finally, the fresh, test data can be used to evaluate how well the algorithm was able to generalize.

Information technology is of import for all data scientists to sympathize the affect the noise can create on the information then, every data scientist must take appropriate measures to pattern algorithms accordingly. This manner, the generalizing capabilities of the algorithm on new data will be far better.

Aman Goel is a Computer Science and Applied science graduate from IIT Bombay. He secured AIR 33 in JEE Advanced 2013! He is now the co-founder of AllinCall Research & Solutions. Aman loves to write motivational articles to help students perform well in JEE and to understand the principles of Data Science.

View all posts