Wednesday, 8 June 2016

Three distinct trends in today’s predictive analytics process

Three distinct trends in today’s predictive analytics process.
1. Predictive hypothesis testing
As businesses begin to ask more strategic questions about what’s really driving their performance, executives are understandably demanding proof before executing on this new era of insight. Specifically, businesses want to understand the cause and effect of various sets of data, and know that this analysis can be extrapolated over different periods of time. This new form of analysis, powered by machine learning, is critical to businesses looking to gain a competitive edge.

2. Closing the gap between data and delivery
The hunger for Big Data doesn’t end with merely gathering the right kind of data. From the thousands of data sets available, executives who are looking to better leverage data to solve complex problems need more streamlined ways to glean insight from the variety of data being collected. Currently, companies often deal with this by creating their own analytics platforms, which is very expensive and doesn’t support all facets of the predictive analytics process. Companies are looking for easier ways to close the data gap and are turning toward more streamlined cloud computing in order to speed up the time to insights.
3. Shrinking the barrier between internal and external data
While internal systems have been consolidated for years, the influx of external data is creating unforeseen silos within businesses. In order to increase efficiency and streamline workflows, companies have implemented web-based data transactions, which have created a great divide between internal data and external data. Companies that are spending thousands on a tool to gather a wide variety of external data sets are now faced needing to spend even more on a solution to combine this data into their internal workflows. As such, executives are demanding easier, quicker and less expensive ways to close this barrier. 
Embracing the new predictive analytics process
The diversity of data sets and sheer amount of external data continues to grow with the speed of technology. Global companies certainly recognize its power, but only now are they beginning to find ways to glean real business value from the insights this data can provide.
From implementing more seamless processes for gathering and correlating external data to finding flexible solutions that enable hypothesis testing and analysis of various data sets, companies looking to fully leverage Big Data to solve big problems must embrace this new era of predictive analytics. By incorporating these three trends into business processes, companies can truly see what’s driving their performance and, ultimately, stay ahead of global competition.

Source:

A New Era of Predictive Analytics: 2016 Trends to Watch by Rich Wagner

Get a glimpse of free Hadoop on-demand training from MapR >>>>>> Learn " Spark Essentials"

Excellent Python Tutorials Sources for you to use

1. Code Academy – Interactive, Beginners

There are many interactive tutorials available for Python that let you write code in the browser and see the results live, right there. That is what makes learning fun! Code Academy hosts the best interactive Python tutorials for beginners.
As of today, there are 2.5m students enrolled for this course. The course length is 13 hours, approximately.
It covers Python syntax, strings and console output, conditional and control flow, functions, lists and dictionaries, loops, file input and output and also talks about advanced python options. You get to build small projects as you learn, step by step instructions make coding these projects easy, right there in the browser.
Couple of the example projects that you build while learning Python on Code Academy are – “Tip calculator” and a small board game named “Battleship”.

2. TutorialsPoint.com – Beginners, Online

Want to learn Python from Scratch? Tutorialspoint.com hosts one of the most comprehensive tutorials for learningPython basics and fundamental concepts. Anyone who is totally new to programming can also start learning with Tutorials Point’s Python tutorials. It starts by giving high level overview of Python, talks about environment setup, basic syntax, and variable types, operators, decision making, loops etc. and the depths of the language.
Tutorials Point also talks about advanced concepts like CGI programming, database access, multithreading, XML processing, GUI programming and Networking etc.

3. Codementor.io – Advanced, For Experts, Online

Codementor is a perfect place to find advanced tutorials if you are an expert developer. This is not about step by step Python learning guide but comes with tutorials to complete specific development tasks using Python. This website keeps on adding new tutorials on regular basis and you can keep yourself updated with latest tutorials by signing up to their newsletter.
Given below are few examples of the kinds of Python tutorial you will find on Codementor.io –
  •  - Building a movie recommendation service using Apache spark and Flask – In Python
  •  - Sorting Git Authors in less than 10 lines of code, of course, using Python script
  •  - Data Science with Python & R: Sentiment Classification Using Linear Methods
  •  - Integrating Node.js & Python to Write Cross-Language Modules using pyExecJs
  •  - Advanced Uses of Python Decorators

4. PythonChallenge.com – Advanced, Interactive

Love challenges and also want to learn the depths of Python programming language, there is no better resource on the internet than PythonChallenge.com. This one of course is not for the absolute beginners and also not for the kinds who just take notes in the classrooms for overnight cramming/reading without bothering their brains much. 
Visit this website, if you want to keep your creative juices flowing. There are 33 levels at the moment and the very first one itself has the potential to get you engaged and addicted to the website.

5. Google’s Python Class – Free eBook, Intermediate

Tutorials from Google developer on Python are well written and cleanly organized. It is all about theory though, without any practical step by step instructions to build projects. Python tutorials from Google developers are Best suited when you want to read through while on the go and/or for those who already have some basic programming background.
This entire Python tutorial set is organized in three sections –
Python Course – This section talks about Python setup, basics of Python like strings, lists, sorting, regular expressions, utilities etc. and feeds you with the fundamentals of the language.
Lecture videos day1, day2 – Not the reading kinds, no problem, you can go through the video lectures divided into day1 and day2 and grasp the fundamentals of the python language.
Python Exercises – This is what makes python learning interesting. There are basic exercises, an exercise around baby names, and one for “copy special” feature and one on the log puzzle. You need to get your head around to understand and get these exercises done.
You can download the Python code used in the exercises and run locally on your machine. You can also ask questions to clarify your doubts in google groups.

6. Python.org – Online, Beginners Python Tutorial

This is the official Python guide and is best suited for those who need a comprehensive tour of the Python language. Official python documentation is a complete reference to the language and always updated with the latest features and release notes.
It is always good to skim through the official guide at least once to ensure that you are not missing on anything basic. The official Python guide covers what is new in Python, Python installation guides, library references, python how-tos, embedding, extending and distributing Python modules.

7. Learn Python the Hard Way – eBook, Beginners

Learn python the hard way is one of the sure shot ways to get on-boarded to Python programming. As per the official website, 1.5 million people read this eBook every year and the book is the most successful beginner programming eBook in the market as of today.
The paper and digital versions of the book come at a cost but you can read the online version of complete book, for free. Arguably the best Python tutorial out there in the wild.

8. LearnPython.org – Interactive Python Programming, Intermediate

This website offers interactive python tutorials to master the syntax of python programming language. The current interpreter runs Python 2 but the tutorial highlights key differences between Python 2 and Python 3 programming.
The tutorial starts off with Hello world, explains variables and types, lists, basic operators, string formatting, basic string operations, loops, functions, classes and objects, dictionaries, modules and packages. You also get exercises at the end of each chapter to get your head working around with the depths of Python web programming.

9. Invent with Python – Free Book, Online

My colleague has been teaching students earlier, learning programming by building games is what keeps students engaged for hours. Invent with Python does the same, each chapter has step by step instructions to build a small game.
As you keep learning, the complexity of games keeps on increasing and learning becomes more fun. The online version of the book is free but you can also download the pdf version of the book at a nominal price.

10. Dive into Python 3 – Beginners, Online

This is one of the good reads for beginners as well as for those who already know Python 2 and want to move to Python 3. Author clearly highlights differences between Python2 and Python 3, wherever applicable, and ensures that readers grasp the concepts by citing relevant examples.

11. Python Crash Course - Intermediate, Online

If you are already a programmer and want to quickly get on-boarded with Python, this is the right place for you. This course is meant for intermediate level programmers and assumes that you already understand object oriented programming.
The target of the course is not to go into depths of programming but to highlight what python brings on the table and how you can code in python if you already know programming.

12. Learning Python Magic Methods – Advanced

This is collection of tutorials/blogs by Refekettler and is intended for advanced level python programmers. Magic methods have anything and everything to do with object oriented programming, but it looks like the official documentation is not good enough. Refekkettler has tried to explain ins and outs of magic methods using good examples. A must read for anyone looking to master the magic methods of Python.

13. Afterhoursprogramming.com – Online, Beginners

Here is another well written tutorial for beginners with code simulator to test the code in the browser. In this tutorial, you not only get to learn fundamentals of python but also learn how you build interactive web applications.
Towards the end, this tutorial hosts Python quiz to test your knowledge. Quiz is pretty useful to check where you stand in terms of Python basics.

14. Python Basic Tutorials – Beginners, Video

Want to learn like they learn in the classroom, video tutorials is the way to go. This one is series of Python video tutorials by theNewBoston. You get end to end coverage of Python by following these video tutorials.

15. Python Fundamentals Training – Beginners, Video

This is more like the previous one but goes a little beyond the basics of coding in Python. It is a full four days training course to let you understand Python fundamentals, watching videos. These Python fundamental training videos are brought to you by NewCircle.com.

16. A Byte of Python - Free Online, PDF

A byte of Python is for absolute novice in the world of computers and programming. It is written by Swaroop and is loved by beginners all over the globe. The language used is simple and the contents are organized neatly. You can read the book online or download the python tutorial pdf copy.

17. Coursera Python Course

Coursera hosts online classes from the top notch universities. It includes Python courses as well provided by Rice University. As of today, there are two part course available spread across many weeks. You can check the schedule and register for the online classes, these are free of cost.
If, however, you need Python certificate after completing the courses, you need to pay some nominal charge for that.

18. Think Python - Free Online, Python Tutorial PDF

This one is another beginner’s book and the author has made online version as well as the pdf version freely available. You can buy the paper copy from Amazon as well. Author’s main intent in this book is to teach computer science fundamentals and Python happens to be the programming language used to do so.
A good read for all the students enrolling for computer science disciplines.

19. Learning Django – Beginners, Video

It is not possible that we talk about Python without uttering a word about Django. Django framework has contributed big time in the recent fame of Python programming language.
If you want to learn Django, there is no better video tutorial available than Getting Started with Django. It has multiple videos covering multiple aspects and best practices of the framework.

20. Python Playgrounds - Coding in Python, Online

Tutorials based on interactive coding playgrounds let you try and learn the language without the hassles of setting up your system for development.
You may not grasp the depths of the language but definitely get to understand the basic concepts. Here are the Python playgrounds to make your learning fun -

Cloudera vs Hortonworks vs MapR: Comparing Hadoop Distributions


Cloudera vs Hortonworks vs MapR: Comparing Hadoop Distributions

For all those looking to harness the potential of big data, Hadoop is the platform of choice. This open source software framework enables processing of huge data sets by distributing them across commodity servers. Thus, it eliminates dependency on high-end hardware and makes the entire process economical for businesses to implement. All of the big data enterprises today use Apache Hadoop in some way or the other. To simplify working with Hadoop, enterprise versions like Cloudera, MapR and Hortonworks have sprung up.

In its original version, Hadoop was designed as a simple write-once storage infrastructure. But it has evolved through the years to expand beyond mere web indexing capacity. Based on Google’s MapReduce model, Hadoop is designed to store and process large amounts and variety of data that may reside in multiple computer servers.

While Hadoop’s distributed file system (HDFS) helps break down all incoming data and store them across multiple nodes, the MapReduce component facilitates the simultaneous processing of data across multiple nodes.

Hadoop is by no means an out-of-the-box solution. In order to build a truly information- driven enterprise, where decisions are based on data and not guess works, the companies would require a data management solution that not only offers robust data governance, but also is easily manageable and seamlessly integrates with existing enterprise infrastructure.

The flexible, modular architecture of haddoop allows for adding new functionalities for the accomplishment of diverse Big Data tasks. A number of vendors have taken advantage of Hadoop’s open-ended framework and tweaked its codes to change or enhance its functionalities. In the process they have been able to fix some of the inherent drawbacks of Apache Hadoop. So far as Hadoop distribution is concerned, the three companies that really stand out in the completion are: Cloudera, MapR and Hortonworks.

Comparing top three Hadoop distributions: Cloudera vs Hortonworks vs MapR

Cloudera has been here for the longest time since the creation of Hadoop. Hortonworks came later. While Cloudera and Hortonworks are 100 percent open source, most versions of MapR come with proprietary modules. Each vendor/distribution has its unique strength and weaknesses, each have certain overlapping features as well. If you are looking to make the most of Hadoop’s immense data processing power, it makes sense in making a comparative study in the top three Hadoop distributions. 

Cloudera

Cloudera Inc. was founded by big data geniuses from Facebook, Google, Oracle and Yahoo in 2008. It was the first company to develop and distribute Apache Hadoop-based software and still has the largest user base with most number of clients. Although the core of the distribution is based on Apache Hadoop, it also provides a proprietary Cloudera Management Suite to automate the installation process and provide other services to enhance convenience of users which include reducing deployment time, displaying real time nodes’ count, etc.

Cloudera Overview

Hortonworks

Hortonworks, founded in 2011, has quickly emerged as one of the leading vendors of Hadoop. The distribution provides open source platform based on Apache Hadoop for analysing, storing and managing big data. Hortonworks is the only commercial vendor to distribute complete open source Apache Hadoop without additional proprietary software. Hortonworks’ distribution HDP2.0 can be directly downloaded from their website free of cost and is easy to install. The engineers of Hortonworks are behind most of Hadoop’s recent innovations including Yarn, which is better than MapReduce in the sense that it will enable inclusion of more data processing frameworks.

Hortonworks Overview

 

MapR

In its standard, open source edition, Apache Hadoop software comes with a number of restrictions. Vendor distributions are aimed at overcoming the issues that the users typically encounter in the standard editions. Under the free Apache license, all the three distributions provide the users with the updates on core Hadoop software. But when it comes to handpicking any one of them, one should look at the additional value it is providing to the customers in terms of improving the reliability of the system (detecting and fixing bugs etc), providing technical assistance and expanding functionalities.

All three top Hadoop distributions, Cloudera, MapR and Hortonworks offer consulting, training, and technical assistance. But unlike its two rivals, Hortonworks’ distribution is claimed to be 100 percent open source. Cloudera incorporates an array of proprietary elements in its Enterprise 4.0 version, adding layers of administrative and management capabilities to the core Hadoop software.

Going a step further, MapR replaces HDFS component and instead uses its own proprietary file system, called MapRFS. MapRFS helps incorporate enterprise-grade features into Hadoop, enabling more efficient management of data, reliability and most importantly, ease of use. In other worlds, it is more production ready than its other two competitors.

Through a recent partnership with Canonical, the creator of Ubuntu operating system, MapR is offering Hadoop as a default component of Ubuntu operating system. Under the terms of the partnership, MapR’s M3 Edition for Apache Hadoop will be integrated into Ubuntu operating system.

Upto its M3 edition, MapR is free, but the free version lacks some of its proprietary features namely, JobTracker HA, NameNode HA, NFS-HA, Mirroring, Snapshot and few more.

MapR Overview

Cloudera and Hortonworks: The Similarities

Cloudera as well as Hortonworks are both built upon the same core of Apache Hadoop. As such, they have more similarities than differences.

·         Both offer enterprise-ready Hadoop distributions. The distributions have stood the test of time as well as consumers, ensuring security and stability. Besides, they provide paid training and services to familiarize the newcomers treading the path of Big Data and Analytics.

·         Both have established communities that actively participate and help with the problems faced as well as demonstrations needed.

·         Both distributions have master-slave architecture.

·         Both have a shared-nothing computing framework.

·         Both support MapReduce as well as YARN.

 

Cloudera vs. Hortonworks: The Differences

That being said, the differences are the ones that play a deciding role of choosing one vendor over the other. Broadly, Cloudera and Hortonworks differ in the following aspects:

·         Cloudera has announced that its long term goal is to become an “enterprise data hub,” thus diminishing the need of data warehouse. Hortonworks, on the other hand, remains firmly a provider of Hadoop distro, and has partnered with data warehousing company Teradata.

·         While Cloudera CDH can be run on windows server, HDP is available as a native component on the windows server. A Windows-based Hadoop cluster can be deployed on Windows Azure through HDInsight Service.

·         Cloudera has a proprietary management software Cloudera Manager, SQL query handling interface Impala, as well as Cloudera Search for easy and real-time access of products. Hortonworks has no proprietary software, uses Ambari for management and Stinger for handling queries, and Apache Solr for searches of data.

·         Cloudera has a commercial license, while Hortonworks has open source license. Cloudera also allows the use of its open- source projects free of cost, but the package doesn’t include the management suite Cloudera Manager or any other proprietary software.

·         Cloudera has a free 60-day trial, Hortonworks is completely free.

Cloudera has been the oldest player in the market, with more than 350 customers. But Hortonworks is fast catching up and has made more innovations in the Hadoop ecosystem in the recent past. Cloudera has several enterprise softwares overlaid on its open source distributions to aid the consumers, whereas Hortonworks strives to provide a framework comprising only of open source projects.

Top 10 Reasons Customers Choose MapR Hadoop

Top 10 Reasons Customers Choose MapR

Not all Hadoop distributions are created equal. Beyond the marketing claims, there are real differences which impact the bottom line while making IT operations easier. Here are 10 reasons why more customers are choosing MapR.


1.

High Availability

MapR takes a holistic approach to high availability. MapR architecture distributes NameNode metadata across all worker nodes in the cluster providing self-healing from multiple failures without requiring additional configuration or hardware. MapR allows instant recovery, with files and tables available rapidly after node failures or cluster restarts. Jobs on MapR do not have to be re-started on node failures and always run to completion. MapR also provides NFS HA for continuous uninterrupted access.


2.

World-Record Performance

Performance benefits businesses across multiple dimensions, not only in getting jobs/work done faster, but also squeezing more value from hardware. MapR has stellar performance achievements including the world record for TeraSort, MapR customer holding the world record for MinuteSort, MapR-DB being 4-7x faster than HBase on other distributions, and OpenTSDB on MapR achieving 100 million data points/sec ingestion rates.

3.

Ease of Data Integration

MapR Direct Access NFS provides NAS like access to Hadoop. While other distributions provide poor-performing, read-only NFS, MapR provides complete random read-write capable, POSIX compliant, highly available, high performance NFS access for production use. With NFS, applications can stream writes directly into Hadoop, any language code works on Hadoop, standard linux commands and tools such as sed and awk are ready for use and your existing browsers and development tools work out of the box.

4.

Real Multi-tenancy Including YARN

Enterprise data hubs or data lakes require different users, applications to coexist on the same cluster with true job isolation and customized security. MapR is the only platform that is built to provide these capabilities with logical volumes, data placement control, and job placement control for both MapReduce v1 and YARN jobs.

5.

Complete Data Protection

MapR ensures the same degree of backup and recovery capability offered by enterprise storage platforms. MapR Snapshots are guaranteed to be consistent, unlike in other distributions, so the snapshot accurately captures the exact state of the cluster at the time the snapshot was taken. Mirroring in MapR allows you to replicate data efficiently across clusters enabling data sharing between production sites, production and research environments or between on-premise and cloud infrastructures.

6.

Lowest Total Cost of Ownership

With world-record performance under its belt, MapR is the most economical choice for building a Hadoop cluster. Supported by a No-NameNode architecture, MapR clusters are also homogeneous and therefore easy to maintain and scale unlike other Hadoop distributions which require special-purpose hardware or complex configurations. In addition, MapR provides fine-grained multi-tenancy to maximize system resources and support multiple workloads and distinct user groups efficiently.

7.

Enterprise-Grade NoSQL

MapR-DB, the in-Hadoop NoSQL database that uses the HBase API, was recently recognized as the top-ranked NoSQL key-value database for current offering. Its high throughput, consistent low latency, enterprise-grade features, and Hadoop integration allow you to deploy business-critical, real-time operational analytics applications. As MapR customer Atzmon Hen-tov of Pontis puts it, "MapR-DB requires about half the machines compared to other [NoSQL] platforms. This dramatically reduces the cost of a new system."

8.

Unbiased Open Source

The MapR Distribution provides customers with more flexibility and choice for their open source projects. MapR supports multiple execution frameworks such as YARN and Spark, and multiple options for SQL-on-Hadoop technologies, machine learning packages, NoSQL databases. Furthermore, MapR uniquely supports backward compatibility across mulitple versions of projects.

9.

Read-Write File System

Unlike HDFS, which follows the write-once-read-many paradigm, the MapR Data Platform delivers a true random read-write capable, POSIX compliant file-system providing unique features such as read-write NFS. These capabilities enable Hadoop as a real-time, enterprise storage and processing platform.

10.

Enterprise-grade Security

MapR security includes wire-level encryption for Hadoop, granular authorization with ACLs and boolean access control expressions, and authentication via existing Kerberos infrastructure or a simplified username-password based mechanism. MapR also implements security projects from Apache for additional layers of protection.


You can get this from the following URL:
https://www.mapr.com/top-ten-reasons