By Camuel Gilyadov, on October 8th, 2010

Feature list of ultimate BigData analytics

  • Volume Scalability => the solution must handle high volumes of data, meaning the cost must scale linearly in the range of 10GB – 10PB.
  • Latency Scalability => the solution must be interactive or batch, and cost must scale linearly in the range of 1 msec – 1 week.
  • Sophistication Scalability => the solution must support simple summing scans or complex multi-way joins and statistics functionality and the cost must scale linearly in the range of simplistic scans to full blown SQL:2008/MDX/imperative in-database-analytics/MapReduce. Report/index viewing is not considered as analytics at all and particularly as not low-sophistication analytics. Report/index creation is analytics and can be of varied sophistication degree. ETL systems is considered as independent analytic systems.
  • Security => any unauthorized access to data must be prevented and in the same time, in-place data analysis (like predicate evaluation) must be possible and resource-efficient.
    • Keeping data always encryption and keeping keys always on client will not work. It will require shipping all the data to the client and is non-starter for big data analytics. So compromises must be made. The issue is especially contentious in public cloud setting.
    • If data is stored encrypted and is continuously decrypted in-place for predicate evaluation, for example, it means that keys must kept in same place (at least temporarily) and it compromises the whole scheme altogether, flooring its cost-benefit factor. The cost of decryption is pretty high.
    • De-identification of all fields may work; random scaling may be applied to numeric fields with subsequent query/result rewrite.
    • Security-by-obscurity methods and defense-in-depth approach may have good cost-benefit factors matching or exceeding overall security for in-house approach.
  • Cost => must have low-TCO that scales linearly to dataset size and the load factor caused by submitted queries. The breakdown (assuming cloud):
    • Storage component linear to dataset size. Economies of scale must bring this cost down significantly. Eventually it must be cheaper than on-site storage.
    • Computing component linear to load with infinite intra-query automatic elasticity. Guarantied elasticity may bear a fixed premium proportional to guarantied capacity. Minor failures of cloud component must not restart long running queries.
    • Bandwidth component. Fedexing hard-drives are by far the cheapest way to upload data, and then query results are really small. How much information human can comprehend instantly after all?
  • Multi-form =>
    • normalized relational
    • star-schema
    • cubes
    • serialized objects / nested data.
    • text
    • media
    • spatial
    • bio / scientific
    • topographical
    • and other data forms must be equally well supported and cross-queried.

1 comment to Feature list of ultimate BigData analytics

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>