About Data Sampling

Sampling provides a way to load a Birst space with a very controlled subset of data. Sampling allows you to develop much more quickly than using a full set of data.

Sampling consists of two parts:

Generating samples
Applying sampling to sources

These two parts are separated such that you can turn sampling on and off in a space or re-sample with different sizes easily. Note that sampling leverages level keys to indicate which columns should be treated as a unique key to sample. There is no strict requirement to have a source be targeted to a level. The level key is simply used to designate a set of columns that you might want to sample.

Generating Samples

There are two methods to generate a sample:

Generating a raw sample based on a source
Generating a derived sample

Generating a Raw Sample Based on a Source

In this method, you type the following command:

createsample dimension level source percent

By indicating a dimension and level, Birst can use the level key of that level to choose which columns to sample. The source will dictate which source file will be used to generate the sample and percent (a value from 1 to 100) indicates what sampling level is desired. When this command is issued, Birst processes that source, examining all rows and creates a separate sample file containing a random sample of the level keys in that file.

Generating a Derived Sample

Because large spaces are sparse, simply taking a random sample of each source file would likely not yield many results as the samples might not intersect. For example in a CRM case where you have an “Account” source that has a record for every customer account and an “Opportunities” source that lists individual sales opportunities, if you sampled these two files independently, you would likely get opportunities in the sample for which there were no accounts. To solve this, you could create a derived sample of opportunities based on the sample of accounts. Since there is a one to many relationship between accounts and opportunities (there is one and only one account for each opportunity, but each account may have many opportunities), you would first use the createsample command with Account using the Account level key. Then you would derive an opportunity sample using another table (likely an opportunity source that includes account keys), for example:

derivesample Account Account Opportunity Opportunity OpportunitiesSource

The syntax is:

derivesample source_dimension source_level target_dimension target_level source

Applying Samples to Sources

Once samples are generated, they need to be applied to source files such that they are used during processing. Each sample is associated with a level key. You can see which samples have been generated with the showsamples command showsamples.

To apply a sample to a source, use the command:

applysample dimension level source

This works because each sample is associated uniquely with a level key. Do this for each source that requires sampling. You can see which samples are applied using the showsamples command. You can reverse this action by using the removesample command, for example:

removesample dimension level source

Note: The removesample command simply tells Birst not to use sampling on a source, it does not delete any sample files.

Once samples are associated with sources, you can process data. The first time data is processed with sampling, Birst uses the full source files to generate sampled sources. These sampled source files are then saved such that you can re-process the data later and the original files aren’t used. This allows you to very rapidly iterate processing.

When you are ready to process the full data, use the deletesamples command. This command will delete the individual sample files for each level key so that processing will use the full source files. Birst still remembers the associations that were previously given so that if you want to sample again later, you can simply generate the sample files without having to reapply them to sources. In this way, sampling can be quickly turned on or off. If you want to increase the amount of sampling done (say from a 2% sample to a 10% sample), use the deletesamples command and resample with the larger percentage.