且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

计算一组线性时间的方式(最常见的元素)?

更新时间:2023-02-06 18:02:09

笔者似乎是根据他的逻辑上的的假设比较的是提供给你的唯一操作。使用基于散列的数据结构的排序的得到解决这个通过减少需要做的大多数情况下的比较的地步,你基本上可以做到这一点在固定时间的可能性

但是,如果数字是钦点总是产生散列冲突,你最终会有效地将你的哈希设置成一个列表,这将使你的算法为O(N²)。正如作者所指出的那样,只是排序值成一个列表首先提供***的保证的算法,即使在大多数情况下,哈希集合将是preferable。

In the book "The Algorithm Design Manual" by Skiena, computing the mode (most frequent element) of a set, is said to have a Ω(n log n) lower bound (this puzzles me), but also (correctly i guess) that no faster worst-case algorithm exists for computing the mode. I'm only puzzled by the lower bound being Ω(n log n).

See the page of the book on Google Books

But surely this could in some cases be computed in linear time (best case), e.g. by Java code like below (finds the most frequent character in a string), the "trick" being to count occurences using a hashtable. This seems obvious.

So, what am I missing in my understanding of the problem?

EDIT: (Mystery solved) As StriplingWarrior points out, the lower bound holds if only comparisons are used, i.e. no indexing of memory, see also: http://en.wikipedia.org/wiki/Element_distinctness_problem

// Linear time
char computeMode(String input) {
  // initialize currentMode to first char
  char[] chars = input.toCharArray();
  char currentMode = chars[0];
  int currentModeCount = 0;
  HashMap<Character, Integer> counts = new HashMap<Character, Integer>();
  for(char character : chars) {
    int count = putget(counts, character); // occurences so far
    // test whether character should be the new currentMode
    if(count > currentModeCount) {
      currentMode = character;
      currentModeCount = count; // also save the count
    }
  }
  return currentMode;
}

// Constant time
int putget(HashMap<Character, Integer> map, char character) {
  if(!map.containsKey(character)) {
    // if character not seen before, initialize to zero
    map.put(character, 0);
  }
 // increment
  int newValue = map.get(character) + 1;
  map.put(character, newValue);
  return newValue;
}

The author seems to be basing his logic on the assumption that comparison is the only operation available to you. Using a Hash-based data structure sort of gets around this by reducing the likelihood of needing to do comparisons in most cases to the point where you can basically do this in constant time.

However, if the numbers were hand-picked to always produce hash collisions, you would end up effectively turning your hash set into a list, which would make your algorithm into O(n²). As the author points out, simply sorting the values into a list first provides the best guaranteed algorithm, even though in most cases a hash set would be preferable.