By Camuel Gilyadov, on October 8th, 2010
Feature list of ultimate BigData analytics
- Volume Scalability => the solution must handle high volumes of data, meaning the cost must scale linearly in the range of 10GB – 10PB.
- Latency Scalability => the solution must be interactive or batch, and cost must scale linearly in the range of 1 msec – 1 week.
- Sophistication Scalability => the solution must support simple summing scans or complex multi-way joins and statistics functionality and the cost must scale linearly in the range of simplistic scans to full blown SQL:2008/MDX/imperative in-database-analytics/MapReduce. Report/index viewing is not considered as analytics at all and particularly as not low-sophistication analytics. Report/index creation is analytics and can be of varied sophistication degree. ETL systems is considered as independent analytic systems.
- Security => any unauthorized access to data must be prevented and in the same time, in-place data analysis (like predicate evaluation) must be possible and resource-efficient.
- Keeping data always encryption and keeping keys always on client will not work. It will require shipping all the data to the client and is non-starter for big data analytics. So compromises must be made. The issue is especially contentious in public cloud setting.
- If data is stored encrypted and is continuously decrypted in-place for predicate evaluation, for example, it means that keys must kept in same place (at least temporarily) and it compromises the whole scheme altogether, flooring its cost-benefit factor. The cost of decryption is pretty high.
- De-identification of all fields may work; random scaling may be applied to numeric fields with subsequent query/result rewrite.
- Security-by-obscurity methods and defense-in-depth approach may have good cost-benefit factors matching or exceeding overall security for in-house approach.
- Cost => must have low-TCO that scales linearly to dataset size and the load factor caused by submitted queries. The breakdown (assuming cloud):
- Storage component linear to dataset size. Economies of scale must bring this cost down significantly. Eventually it must be cheaper than on-site storage.
- Computing component linear to load with infinite intra-query automatic elasticity. Guarantied elasticity may bear a fixed premium proportional to guarantied capacity. Minor failures of cloud component must not restart long running queries.
- Bandwidth component. Fedexing hard-drives are by far the cheapest way to upload data, and then query results are really small. How much information human can comprehend instantly after all?
- Multi-form =>
- normalized relational
- star-schema
- cubes
- serialized objects / nested data.
- text
- media
- spatial
- bio / scientific
- topographical
- and other data forms must be equally well supported and cross-queried.
|
[...] was excited about virtually architecting an ideal analytics system. However I quickly realized that all “care abouts” cannot be satisfied simultaneously and some [...]