Everything always starts easy. You might only have a few hundred objects in your collection that you need to process. It's easy because you can do everything in one process, one thread, and your results are always returned quickly.
A while later in your life cycle things got more complex and we start do see performance issues. Some architectures have there own ways of dealing with this to a degree.
- Web Farms
- Multi-threading
- Grid computing
However one thing that is more difficult is when a single request needs to work through a lot of data to achieve a simple result. For example, a web farm is a great means of allowing a million users to query small pieces of data. The first problem that arrives is when one user wants a million pieces of data.
Accessing data
Databases are very good at this, and I would be surprised if you or I could produce something general purpose that could beat its returned strategies. They even provide caching capabilities so if a million users all ask for the same million rows, then the database will first take a while to produce that result but then can re-send it to the remaining 999,999 uses.
However the problem comes as we either the data or the query starts to change. Suddenly our percentage of cache misses starts to increase. At this point we actually start to ask ourselves the question that we should have asked a long time ago. Why do I need to gather a million rows?
Client caching
A performance improvement is to not store the million rows in a centralised database and keep querying it, but instead to only reload data that has changed. This leads to update schemes where updates to the the centralised data need to be pushed to the clients, although it is also possible but less desirable for all clients to poll in frequently.
One idea that has come along recently is the concept of shared cache. Products such as Microsoft's Velocity now provide a means for the in memory cache to be shared across many machines. This obviously is targeted to applications where communication between application servers in a farm should be quicker than access to the central database farm. It would also be interesting to see if this works in the client application too, providing peer to peer style updates could certainly be quicker than retrieving back from a centralised server farm, particularly if the clients are at a clients (customers) site (i.e. client -> server, across a WAN; client - client, across a LAN) .
Performing operations
Once you have got your data the next problem occurs with using it. Performing a calculation on 100 numbers is quick, on a million is slower relatively. This is commonly referred to as the Map-Reduce problem, and there are quite a few ways to handle this, although the solutions can vary depending on what you want to do with your calculation.
Simple Map operations
If you need to perform a simple operation on each piece of data, so that for a list of n elements, and each one maps to a single result then you will finish with n results, this is known as mapping. In this case there is no interaction between each element of data and the operations can be performed in parallel. If you run this within a single process, then I would recommend investigating the parallel library, currently expected to become part of core DotNet in v4.0.
The problem with single processes is that they are still limited to a single machine. If you can provide a means of distributing your operation to a set of machines, then you should find that the performance improves, but unfortunately this isn't linearly. Your problem is the overheads, the more you need to start processes, allocate memory or pre-calculate a value, the longer the process will take. I have come to know this as the 'Granularity problem'.
- The smaller the granule of work, then the more overheads will be incurred in processing the entire list.
This effect can be exaggerated depending on the architecture you use to run the processing operations, for example, if you have a pre-configured number of processors that exist in the memory of the machines that will do the processing, then you will remove the overhead of process startup. Then will simply need to gather the process operation package, perform it and return the result. Conversely a more dynamic system that loads up each package and then loads in the processor to complete the operation will have a much greater overhead. As a final alternative, a system that loads a package, then loads up the processor, and when complete checks for other packages that can be handled in its current state sounds a great compromise, unless you have prioritisation schemes that need to be considered as well.
Reduce
If you simply need to add up a list of numbers or perform some operation that reduces a list into a single value, then this is known as a 'reduce'. Here there are some improvements can be made if the operation can be applied iteratively. For example, a list of million numbers will add up to the same total whether you add them up one by one, or if you break them into chunks of 1000, and each of the 1000 values up to a total, then add the 1000 totals to get a grand total.