Monday, October 13, 2008

Sir, a sample please?



If ever we got unlimited monies, processing powers, complete coverages and assuming no constraints at all, we can ignore the idea of sampling since by then doing sampling is like perfoming double works if we can actually work on the population itself.

Down to earth, in every business decision we faced the quintessential requirements to take some trade-offs between many constraints, i.e. budget and time. When deciding on business directions, strategy or even a daily routine task, it is imperative that the person to understand the nature of the problem before formulating a proper solution. To understand the problem in context, usually it involves actions such as monitoring certain characteristics of items, obtaining input from people and etc. If we can do the data gathering on every single item of interest, then we are working directly on the population instead of sampling part of it. Some problems mandate sampling because it is impractical to deal with entire population, if ever it is possible. For example, to find out the average height/weight of asian adult male, it is time consuming and might be impossible to capture the height information from every matched candidates. It may be the case that some of them are hiding somewhere in the jungle and therefore your data is incomplete. In other words, at best efforts, you are only approximating the population, i.e. sampling. Other sampling scenarios include production process control mechanism that check the product characteristics randomly at certain intervals, market research surveys that targeting certain stratum of some geographical locations and anthopological studies.

We now know that a sample is a subset or part of a population and a sampling process is basically drawing that part from that population. By having a representative, non-biased and sufficient sample, we could draw (i.e. infer) conclusions about the underlying population.

In the statement above, it is obvious that the conclusion might be misleading or totally wrong if the sample is "non-representative", "biased" and "insufficient".

So that's why some more complicated sampling techniques other than simple random sampling exists to reduce the effects of some of these, e.g. stratified sampling, cluster sampling and multistep sampling. But for you, just need to remember that all these techniques are just ways to draw items from the population.

Once you have a good sample, the next logical step you might want to perform includes organizing, describing and summarizing the samples quantitatively and graphically. The important milestone in this step is to get a grasp of the sample's probability distribution, i.e. the random variation pattern in the sample. From the probability distribution, you can infer about the population probability distribution. Only by confirming the distribution, you can be sure that your choice of analysis tools are compliant and consistent.

In terms of formal terms, statisticians use the word "statistic" for numerical characteristics of sample and the word "parameter" for similar characteristic of population. The common symbols used are different too, e.g. lowercase s for sample's standard deviation and lowercase greek letter sigma for population's standard deviation.

Some common probability distributions include hypergeometric, binomial and Poisson for discrete random variables and exponential and normal for continous randon variables, among others.

Well, all the above are fundamental knowledge, I'm just put it down in words. Easy right?


No comments: