I publish research papers relatively sparingly, with each publication often benefiting from experience with real production systems. Here's a summary of some of my research interests over the years:
The killer microsecond: I've observed that while today's computing systems can efficiently handle events at the nanosecond level (through hardware mechanisms) and millisecond level (through operating systems mechanisms), they are dreadfully unequipped to handle microsecond level events. That orphaned microsecond time scale is however the most critical one for the distributed software systems that run inside a modern datacenter, suggesting that new hardware support for fast I/O is sorely needed. An article on this topic is currently under preparation.
Tail-tolerant systems: in "The Tail at Scale", we describe the challenge of building services that maintain acceptable response-times as the system grows in scale, and suggest that new principled techniques are needed to design scalable, low-latency systems. Analogous to fault-tolerance, tail-tolerant systems should remain responsive even when some of its subcomponents experience periods of unresponsiveness. See The Tail at Scale.
Energy proportionality: production datacenters tend to exhibit lower energy efficiency in practice compared to the ratings of the underlying equipment in standardized benchmarks. The disconnect was determined to be activity levels. Computing equipment runs more efficiently when it is fully utilized (aka: at peak periods of load) and rather inefficiently at mid-to-low utilizations, and robust production Web services typically run at such mid-to-low utilizations. Our work has challenged the industry to produce energy proportional systems, which would run at high energy efficiency regardless of utilization levels. See The Case for Energy-Proportional Computing
Power Provisioning: this is the problem of maximizing the number of computers that can be safely hosted in a datacenter facility that is rated to provision a given maximum amount of power to IT equipment. If a facility can host more computers the hosting cost per machine is lower, but if too many computers are hosted their combined load could exceed the maximum available power, with possibly drastic consequences. See Power Provisioning for a Warehouse-sized Computer
Warehouse-scale computing: we have long held the opinion that the machinery that runs the cloud, large-scale datacenters hosting multiple distributed Web services simultaneously, constitute a new class of computers. I have coined the term "Warehouse-scale computers" to highlight the notion that these are not just collections of servers, and that their design and operation can benefit from a principled computer architecture methodology. My colleagues and I have published articles on this topic, which is the central theme for our book Datacenter as a Computer
Storage availability: how often do disks fail? Can predict or explain those failures? How can we go from knowing failure statistics of individual storage nodes to reasoning about their impact on the availability of distributed storage services? Two USENIX papers at FAST and OSDI explore these issues.
Multi-core processors: while at Digital Equipment and Compaq Corp., our research team investigated the most useful features of processors for commercial workloads and decided to design a chip to address it. The system, an 8-core architecture with "wimpy" CPUs inspired some of the multi-core processors in the market today. See Piranha