Posts tagged Distributed

Cloud Poll: Can Microsoft’s Distributed Analytics Tools Compete with Hadoop?

poll This week Microsoft Research released Project Daytona MapReduce Runtime, a developer preview of a new product designed for working with large distributed data sets. Microsoft also has a big data analytics platform that uses LINQ instead of MapReduce called LINQ to HPC. Notably, LINQ to HPC is used in production at Microsoft Bing.

But Microsoft is entering an increasingly crowded market. There’s the open source Apache Hadoop, which is now being sold in different flavors by companies such as Cloudera, DataStax, EMC, IBM and soon a spin-off of Yahoo. Not to mention HPCC which will be open-sourced by LexisNexis.

Microsoft’s products are currently in early, experimental stages and the company may never step up the development and marketing of these to be serious Hadoop and HPCC competitors. But could Microsoft be competitive here if it wants to?

Sponsor

Discuss



View full post on ReadWriteWeb

Red Hat Announces NoSQL Inspired Distributed Data Cache

Red Hat logo Red Hat today announced JBoss Enterprise Data Grid 6, which it calls “a cloud-ready, highly scalable distributed data cache.” Cameron Purdy defines a data grid as “a system composed of multiple servers that work together to manage information and related operations – such as computations – in a distributed environment.”

Like Apache Cassandra and Riak, Red Hat’s data grid is influenced by Amazon’s distributed data store Dynamo. The product will cache data in-memory and distribute among multiple servers, which will be useful for cloud computing.

Sponsor

The new product is based on the JBoss community project Infinispan.

According to the announcement, “JBoss Enterprise Data Grid is part of Red Hat’s vision to redefine middleware and provide a comprehensive, open source distributed service fabric to help developers and organizations build, deploy and manage applications in the cloud.”

Those interested can sign-up for early access to Data Grid here.

JBoss Enterprise Data Grid will compete with data grids such as Oracle Coherence.

Discuss



View full post on ReadWriteWeb

3 Ways to Virtualize Applications with Distributed Computing

lead-image-cluster.jpgThe explosion of data driven by sensors, data mining and social media and other Web-based interactions means that more and more companies will need to find ways of dealing with massive data sets – even companies that haven’t typically been data driven before. But new business analytics applications may require more processing power than your organization has ever needed before, requiring you to find ways to handle data as efficiently as possible. Infrastructure-as-a-service providers and inexpensive data warehousing appliances with in-memory analytics will provide options for many organizations. But some may find distributed computing a better fit for their organization’s big data needs.

Scientists and academics have been taking advantage of distributed computing for years, but it’s an approach that can benefit information workers in other areas. Here are some methods of running applications in distributed environments, including some newer approaches.

Sponsor

Beowulf Clustering

Made popular by NASA in 1993, cluster computing uses commodity hardware to pool resources for virtual applications. Pooling multiple systems allows you to take advantage of parallel processing. Parallel processing puts multiple processors to work on a problem simultaneously – multiple slower processors working in parallel are generally more efficient than a single fast processor working alone. Supercomputers use multiple processors for parallel processing, but a cluster of low-end machines working in unison can become the equivalent of a large supercomputer.

Beowulf clustering is a popular architecture for cluster computing. The open source Parallel Virtual Machine (PVM) software package and Message Passing Interface (MPI) implementations, such as OpenMPI or MPICH, are common software for building these clusters.

The advantage of using this approach is that you can run an application on a large number of inexpensive pieces of hardware, including systems with completely different hardware. However, some of the newer methods we’ll discuss next may be preferable.

Server Aggregation

Although a Beowulf cluster attempts to mimic the behavior of a single machine with multiple processors, each part of the cluster still has its own operating system and software stack installed. Instead of having a virtual application that runs across several machines with each running an operating system, the server aggregation approach runs a single instance of an OS across all the servers in a cluster. Therefore, all the physical resources go to the virtual machine.

This can be a real cost saver. The cost of servers is non-linear, so it’s generally cheaper to buy several two-socket systems than a single multi-socket system. And since you’ll have multiple servers, they can be used for other purposes when you don’t need them for massive number crunching jobs.

Server aggregation is a relatively new approach to virtualization. As far as we know the only company to offer server aggregation solutions is ScaleMP. However, we expect to see this approach take off over the next few years.

Server aggregation makes sense if you want to use your own hardware. But what if you want to use a public infrastructure-as-a-service?

Virtual Cluster Appliances

A virtual appliance is typically a VM designed to do a specific function with minimal configuration. Virtual cluster appliances are VMs designed for cluster computing right of the out of the box. You can learn more about the approach here.

One of the advantages here is that these VMs can be deployed to a cloud service like Amazon EC2. Instead of hosting several physical servers running virtualization or aggregation software, you can have many virtual servers running in parallel in the cloud. You still get the advantage of massively parallel computing, but without the hassle of running physical infrastructure.

The Nimbus Project is an open source toolkit for creating virtual infrastructure for cluster computing. It was used by the STAR project build a 100 VM cluster on EC2. Another source for virtual cluster appliances is Grid Appliance, which offers both a general purpose cluster appliance and one built specifically for Apache Hadoop.

Photo by hutch

Discuss



View full post on ReadWriteWeb

Burned by Twitter’s API Restrictions, Developers Launch Distributed Microblogging Service

ElephantDB, a Distributed Database for Working with Hadoop

We first told you about ElephantDB earlier this year in our article Secrets of BackType’s Data Engineers. But we didn’t link to the GitHub repo, which has been making rounds in the blogosphere for the past couple days.

As a refresher, ElephantDB is an distributed database created by BackType> to export data from Hadoop and serve it into analytics applications, APIs, etc.

Sponsor

A bit more detail from the ReadMe:

ElephantDB is a database that specializes in exporting key/value data from Hadoop. ElephantDB is composed of two components. The first is a library that is used in MapReduce jobs for creating an indexed key/value dataset that is stored on a distributed filesystem. The second component is a daemon that can download a subset of a dataset and serve it in a read-only, random-access fashion. A group of machines working together to serve a full dataset is called a ring.

Since ElephantDB server doesn’t support random writes, it is almost laughingly simple. Once the server loads up its subset of the data, it does very little. This leads to ElephantDB being rock-solid in production, since there’s almost no moving parts.

ElephantDB server has a Thrift interface, so any language can make reads from it. The database itself is implemented in Clojure.

An ElephantDB datastore contains a fixed number of shards of a “Local Persistence”. ElephantDB’s local persistence engine is pluggable, and ElephantDB comes bundled with a local persistence implementation for Berkeley DB Java Edition. On the MapReduce side, each reducer creates or updates a single shard into the DFS, and on the server side, each server serves a subset of the shards.

Also of note is Cascalog, a programming language derived from Clojure for working with Hadoop.

Discuss



View full post on ReadWriteWeb

Are Distributed Teams Less Effective?

working-from-home.pngAre colleagues who collaborate remotely via the Internet at a disadvantage compared to their bricks-and-mortar office-bound counterparts? One recent study from Harvard University suggests so, according to a blog post by Wired columnist Clive Thompson.

The study, titled “Does Collocation Inform the Impact of Collaboration?” examined this very question using a sample 35,000 biomedical research papers. The team of three scientists calculated the physical distance between the authors of each paper and looked at how many citations the papers received, as a way of measuring how influential they were.

Sponsor

The study found that researchers who lived closer together produced more impactful research. In fact, the closer the authors were, the better. “Teams located in the same building did better than teams that were merely in the same city, and teams that were in the same city did better than those that were inter-city,” writes Thompson.

This is a potentially important question for businesses of all sizes, as the workforce becomes increasingly mobile, thanks to smart phones and cloud-based Web apps.

Of course, it’s hard to know if these results apply to other forms of work, beyond biomedical research. The advantages and disadvantages of working on a distributed team may well depend on the type of work being performed, with some tasks and projects being more ideal for being worked on remotely than others.

What do you think think? Is it better to work remotely or in the same physical space? Let us know your thoughts in the comments.

Photo courtesy of Flickr user brad montgomery

Discuss



View full post on ReadWriteWeb

Hacker Poll: Do You Use Distributed Version Control?

RedMonk logo Red Monk analyst Stephen O’Grady recently looked at projects hosted on Ohloh and found that only 13.8% of those projects are using distributed version control like Git and Mercurial. All other projects were using centralized systems like CVS and SVN. O’Grady goes on to cite many benefits in using distributed systems, and the reasons that most developers don’t take take advantage of them.

I’m curious: do you use distributed version control for any of your projects? If not, why not?

Sponsor

Repository type share chart
O’Grady’s break down of version control systems on Ohloh

Discuss



View full post on ReadWriteWeb

CeBIT 2010 News Distributed By Business Wire Available At WWW.TRADESHOWNEWS.COM

CeBIT 2010 News Distributed By Business Wire Available At WWW.TRADESHOWNEWS.COM
HANNOVER, Germany—-Business Wire will be posting exhibitor news releases issued via Business Wire on a dedicated show news archive for this event. Media may contact the Business Wire Event Services Group at tradeshow@businesswire.com with any questions pertaining to these news releases.

Read more on Business Wire via Yahoo! Finance