Coursera公开课-Algorithm-WordNet

问题

Part-Two week1的作业是wordnet。wordnet是一个单词网络。用有向边来表示单词的关系。比如miracle —>event，说明miracle是event的一种，他们是isa关系。

第一个问题是对于给定的一个有向图，找出两者之间的最短距离和公共祖先。

public class SAP {
   // constructor takes a digraph (not necessarily a DAG)
   public SAP(Digraph G)
   // length of shortest ancestral path between v and w; -1 if no such path
   public int length(int v, int w)
   // a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path
   public int ancestor(int v, int w)
   // length of shortest ancestral path between any vertex in v and any vertex in w; -1 if no such path
   public int length(Iterable<Integer> v, Iterable<Integer> w)
   // a common ancestor that participates in shortest ancestral path; -1 if no such path
   public int ancestor(Iterable<Integer> v, Iterable<Integer> w)
   // do unit testing of this class
   public static void main(String[] args)
}

第二个问题是对于给定的一个wordnet，找出wordnet中的所有单词，并计算任意两个单词之间的距离。

public class WordNet {
   // constructor takes the name of the two input files
   public WordNet(String synsets, String hypernyms)
   // returns all WordNet nouns
   public Iterable<String> nouns()
   // is the word a WordNet noun?
   public boolean isNoun(String word)
   // distance between nounA and nounB (defined below)
   public int distance(String nounA, String nounB)
   //distance(A, B) = distance is the minimum length of any ancestral path between any synset v of A and any synset w of B.
   // a synset (second field of synsets.txt) that is the common ancestor of nounA and nounB
   // in a shortest ancestral path (defined below)
   public String sap(String nounA, String nounB)
   // do unit testing of this class
   public static void main(String[] args)
}

最后一个问题是找出wordnet当中和其他单词关联性最小的一个单词，换句话说，我们要找出一个单词Ai使得

1	di = dist(Ai, A1) + dist(Ai, A2) + ... + dist(Ai, An)

最大。

public class Outcast {
   public Outcast(WordNet wordnet)         // constructor takes a WordNet object
   public String outcast(String[] nouns)   // given an array of WordNet nouns, return an outcast
   public static void main(String[] args)  // see test client below
}

更具体的题目描述，请看：http://coursera.cs.princeton.edu/algs4/assignments/wordnet.html

输入数据格式
wordnet输入有两个文件，一个是synsets，存储了wordnet当中所有单词，主要有三部分，第一部分是一个整数，是sysnets中的id，第二部分是一组单词，第三部分是这组单词的解释。如

1
2
3

> 36,AND_circuit AND_gate,a circuit in a computer that fires only when all of its inputs fire 
>
>

表示单词集合{AND_circuit, AND_gate}在sysnet中的id是36，他们有个共同的解释就是a circuit in a computer that fires only when all of its inputs fire 。

wordnet第二个输入文件是hypernyms，存储了synsets之间的关系，如

1 2	> 164,21012,56099 >

表示sysnet 164 (“Actifed”)有两个子类，21012(“antihistamine”)和56099 (“nasal_decongestant”)，这和sysnets文件的输入是对应的

164,Actifed,trade name for a drug containing an antihistamine and a decongestant…
21012,antihistamine,a medicine used to treat allergies…
56099,nasal_decongestant,a decongestant that provides temporary relief of nasal…

思路分析

从哪里入手呢？官方很贴心地给出了步骤(Possible progress steps)：http://coursera.cs.princeton.edu/algs4/checklists/wordnet.html

Create the data type SAP
Implement the remaining WordNet methods.
Implement Outcast.
如果没有思路，可以先看下官方给的checklists，非常有用。

我们看SAP，求最短路径和公共祖先。因为wordnet是一个有向图，并且不会有环（这点题目中已说明），因此求最短路径只需要走一遍bfs即可，同时记录下可达点。有了SAP后，求wordnet中的distance其实就是直接套用SAP的代码。wordnet中最关键的还是数据如何存储。如果是以id来作为key的话，遇到一个id对应多个单词的情况处理起来比较麻烦，因此我们以单词为key，存储单词对应的id(可能有多个)，求wordnet中的distance(String nounA, String nounB)，其实就是求单词对应id的最短路径，直接用SAP来求即可。Outcast应该是好实现的了。直接遍历wordnet当中所有单词两两之间的距离，求一个最小的即可。时间复杂度是O(N^2)。

代码实现

SAP

import edu.princeton.cs.algs4.Digraph;
import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.StdIn;
import edu.princeton.cs.algs4.StdOut;
public class SAP {
    private static final int INFINITY = Integer.MAX_VALUE;
    private Digraph G;
    public SAP(Digraph G) {
        checkCondition(G == null);
        this.G = new Digraph(G);
    }
    // length of shortest ancestral path between v and w; -1 if no such path
    public int length(int v, int w) {
        checkCondition(v < 0 || v >= G.V());
        checkCondition(w < 0 || w >= G.V());
        return getLengthOf(v, w);
    }
    // a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path
    public int ancestor(int v, int w) {
        checkCondition(v < 0 || v >= G.V());
        checkCondition(w < 0 || w >= G.V());
        return getAncestorOf(v, w);
    }
    // length of shortest ancestral path between any vertex in v and any vertex in w; -1 if no such path
    public int length(Iterable<Integer> v, Iterable<Integer> w) {
        checkCondition(v == null);
        checkCondition(w == null);
        for (int v1 : v) checkCondition(v1 < 0 || v1 >= G.V());
        for (int w1 : w) checkCondition(w1 < 0 || w1 >= G.V());
        return getLengthOf(v, w);
    }
    // a common ancestor that participates in shortest ancestral path; -1 if no such path
    public int ancestor(Iterable<Integer> v, Iterable<Integer> w) {
        checkCondition(v == null);
        checkCondition(w == null);
        for (int v1 : v) checkCondition(v1 < 0 || v1 >= G.V());
        for (int w1 : w) checkCondition(w1 < 0 || w1 >= G.V());
        return getAncestorOf(v, w);
    }
    private int getAncestorOf(Object v, Object w) {
        int ancestor = -1;
        DeluxeBFS uv = new DeluxeBFS(G, v);
        DeluxeBFS uw = new DeluxeBFS(G, w);
        int result = INFINITY;
        for(int i = 0; i < G.V(); i++) {
            if(uv.hasPathTo(i) && uw.hasPathTo(i)) {
                int distance = uv.distTo(i) + uw.distTo(i);
                if(distance < result) {
                    result = distance;
                    ancestor = i;
                }
            }
        }
        return ancestor;
    }
    private int getLengthOf(Object v, Object w) {
        DeluxeBFS uv = new DeluxeBFS(G, v);
        DeluxeBFS uw = new DeluxeBFS(G, w);
        int result = INFINITY;
        for(int i = 0; i < G.V(); i++) {
            if(uv.hasPathTo(i) && uw.hasPathTo(i)) {
                int distance = uv.distTo(i) + uw.distTo(i);
                if(distance < result) {
                    result = distance;
                }
            }
        }
        return result == INFINITY ? -1 : result;
    }
    private void checkCondition(boolean inValid) {
        if (inValid)
            throw new IllegalArgumentException("param must be not null");
    }
    public static void main(String[] args) {
        In in = new In(args[0]);
        Digraph G = new Digraph(in);
        SAP sap = new SAP(G);
        while (!StdIn.isEmpty()) {
            int v = StdIn.readInt();
            int w = StdIn.readInt();
            int length   = sap.length(v, w);
            int ancestor = sap.ancestor(v, w);
            StdOut.printf("length = %d, ancestor = %d\n", length, ancestor);
        }
    }
}

WordNet

import edu.princeton.cs.algs4.DirectedCycle;
import edu.princeton.cs.algs4.Bag;
import edu.princeton.cs.algs4.Digraph;
import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.ST;
import java.util.ArrayList;
public class WordNet {
    private ST<String, Bag<Integer>> mDataSet = new ST<>();
    private ArrayList<String> mAllNouns = new ArrayList<>();
    private Digraph mDigraph;
    private SAP mSAP;
    // constructor takes the name of the two input files
    public WordNet(String synsets, String hypernyms) {
        validateParam(hypernyms);
        validateParam(synsets);
        In synsetsIn = new In(synsets);
        int V = 0;
        while (synsetsIn.hasNextLine()) {
            String[] data = synsetsIn.readLine().split(",");
            int id = Integer.parseInt(data[0]);
            String[] nouns = data[1].split(" ");
            for (String noun : nouns) {
                if (mDataSet.contains(noun)) {
                    mDataSet.get(noun).add(id);
                } else {
                    Bag<Integer> bag = new Bag<Integer>();
                    bag.add(id);
                    mDataSet.put(noun, bag);
                }
            }
            V++;
            mAllNouns.add(data[1]);
        }
        mDigraph = new Digraph(V);
        boolean[] isNotRoots = new boolean[V];
        In hypernymsIn = new In(hypernyms);
        while (hypernymsIn.hasNextLine()) {
            String[] data = hypernymsIn.readLine().split(",");
            int childId = Integer.parseInt(data[0]);
            isNotRoots[childId] = true;
            for (int i=1; i< data.length; i++) {
                int parentId = Integer.parseInt(data[i]);
                mDigraph.addEdge(childId, parentId);
            }
        }
        if (mSAP == null) {
            mSAP = new SAP(mDigraph);
        }
        int rootCount = 0;
        for (boolean notRoot : isNotRoots)
            if (!notRoot) rootCount++;
        DirectedCycle cycle = new DirectedCycle(mDigraph);
        cycle.hasCycle();
        if (rootCount > 1 || cycle.hasCycle())
            throw new IllegalArgumentException("not a valid WordNet! " +
                    "please check your input!");
    }
    // returns all WordNet nouns
    public Iterable<String> nouns() {
        return mDataSet.keys();
    }
    // is the word a WordNet noun?
    public boolean isNoun(String word) {
        validateParam(word);
        return mDataSet.contains(word);
    }
    // distance between nounA and nounB (defined below)
    public int distance(String nounA, String nounB) {
        validateParam(nounA);
        validateParam(nounB);
        Bag<Integer> bagOfNounA = mDataSet.get(nounA);
        Bag<Integer> bagOfNounB = mDataSet.get(nounB);
        return mSAP.length(bagOfNounA, bagOfNounB);
    }
    // a synset (second field of synsets.txt) that is the common ancestor of nounA and nounB
    // in a shortest ancestral path (defined below)
    public String sap(String nounA, String nounB) {
        validateParam(nounA);
        validateParam(nounB);
        Bag<Integer> bagOfNounA = mDataSet.get(nounA);
        Bag<Integer> bagOfNounB = mDataSet.get(nounB);
        int ancestor = mSAP.ancestor(bagOfNounA, bagOfNounB);
        return mAllNouns.get(ancestor);
    }
    private void validateParam(Object param) {
        if (param == null)
            throw new IllegalArgumentException("param must be not null");
    }
    // do unit testing of this class
    public static void main(String[] args) {
    }
}

Outcast

import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.StdOut;
public class Outcast {
    private WordNet mWordNet;
    // constructor takes a WordNet object
    public Outcast(WordNet wordnet) {
        validateParam(wordnet);
        mWordNet = wordnet;
    }
    // given an array of WordNet nouns, return an outcast
    public String outcast(String[] nouns) {
        validateParam(nouns);
        int[] distance = new int[nouns.length];
        for (int i=0; i<nouns.length-1; i++){
            for (int j=i+1; j<nouns.length; j++){
                int dist = mWordNet.distance(nouns[i], nouns[j]);
                if (i != j) distance[j] += dist;
                            distance[i] += dist;
            }
        }
        int maxDistance = 0;
        int maxIndex = 0;
        for (int i=0; i < distance.length; i++){
            if (distance[i] > maxDistance){
                maxDistance = distance[i];
                maxIndex = i;
            }
        }
        return nouns[maxIndex];
    }
    private void validateParam(Object param) {
        if (param == null)
            throw new IllegalArgumentException("param must be not null");
    }
    // see test client below
    public static void main(String[] args) {
        WordNet wordnet = new WordNet(args[0], args[1]);
        Outcast outcast = new Outcast(wordnet);
        for (int t = 2; t < args.length; t++) {
            In in = new In(args[t]);
            String[] nouns = in.readAllStrings();
            StdOut.println(args[t] + ": " + outcast.outcast(nouns));
        }
    }
}

总结

不得不说官方的API设计得非常好，甚至连test case都帮你搞定，让学生可以专注于算法的设计而不用去考虑其他琐碎的事情，你需要做的仅仅是想如何数据结构，使用什么算法去实现。checklists也很详细，做题目之前把checklists看一遍可以省掉你不少时间。更令人称赞的是他们设计得这套评分系统，不仅仅考察输出结果的正确性，包括对于不合法的输入检查，时间复杂度和内存的使用都在考察范围之内，要想一次拿100分还真不容易。