Mining Source Code Repositories with Solr

Exploring Full-Text Search

www.garysieling.com/jug

Gary Sieling / @garysieling

Overview

  • Demo
  • How-to
  • Inspiration

Motivation

Which engineer worked on a particular client/project/technology?

Project Structure

Search for Experts In...







Search for Experts In...







Corporate Open Source...







Motivation

Who is an expert in technology X?

Solr - Schema


<field name="id" type="string" indexed="true" stored="false" required="true" /> 
<field name="author" type="git_author" indexed="true" stored="true" required="true" /> 
<field name="company" type="string" indexed="true" stored="true" required="true" /> 
<field name="year" type="string" indexed="true" stored="true" required="true" /> 
<field name="email" type="string" indexed="true" stored="false" required="true" /> 
<field name="message" type="string" indexed="true" stored="false" required="true" /> 
<field name="search" type="text_general" indexed="true" stored="false" required="false" /> 
    

Stats for Live Demo

  • 2.1 GB Git History => 132 MB Solr Repository
  • Index of 232,839 commits
  • Drupal, Git, Lucene, Lucene.NET, Solr, Mono, Node.JS, PHP, Postgres, Vagrant

Wingspan Source Code

  • 2,000 MB Git Archive => 90 MB Solr Index
  • Works well for finding project leads
  • Works well for finding technology experts

Github Archive

  • 202 GB Git Archive => 1 GB Solr Index
  • 18,000 repositories
  • 4,554,502
  • 2-3 hour conversion process
  • Developer workstation (i7 / 16 GB RAM)

What is Full Text Search?

  • The dogs ran home => dog run home
  • The dog runs home => dog run home
  • Full Text Search vs. Google

    What is Solr?

    How this is built

  • Front-end
  • Back-end
  • JSON API

    /select/?q=search:sql

    &version=2.2

    &start=0&rows=0

    &indent=on

    &facet=on

    &facet.field=author

    &facet.method=fc

    &facet.limit=30

    &wt=json

    Building a facet UI

    
    var url = 'http://solr/core1/select/?q=search:(search:jug)';
    $.ajax({
      type: "GET",
      url: url,
      success: function(response){
        // display search results
    });
    

    Backend ETL

    • Extracting Git Commits
    • Transforming Commits
    • Loading data to Solr

    Why do you need Java code?

    • Extract data (e.g. PDF contents)
    • Data transformations (email -> company)

    Extracting Data from Git

    
    FileRepositoryBuilder builder = new FileRepositoryBuilder();
    Repository repository = 
      builder.setGitDir(new File(path))
        .build();
    
    RevWalk walk = new RevWalk(repository);
    
    for (Ref ref : repository.getAllRefs().values()) {
      if ("HEAD".equals(ref.getName())) {
        walk.markStart(walk.parseCommit(ref.getObjectId()));
        break;
      }
    }
    
    for (RevCommit commit : walk) {
      ...
    }
        

    Extracting Data - Patches

    
    // To fetch file diffs, we must provide a stream, where they will be written:
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    DiffFormatter df = new DiffFormatter(out);
    df.setRepository(repository);
    
    List<DiffEntry> diffs = df.scan(parent.getTree(), commit.getTree());
    
    // Each of these objects is one file in the commit
    for (Object obj : diffs) {
      DiffEntry diff = (DiffEntry) obj;
      FileHeader fh = df.toFileHeader(diff);
      df.format(diff);
      String diffText = out.toString("UTF-8");
    
      // Reset the stream, so we can get each patch separately
      out.reset();
    
      ...
    }
    

    Extracting Commit Metadata

    
    // For this application, only find new or modified files
    if (diff.getChangeType() == DiffEntry.ChangeType.MODIFY ||
        diff.getChangeType() == DiffEntry.ChangeType.ADD)) {
    
      // We have enough information now to get commit messages, 
      // file names, and author information
      System.out.println(diff.getNewPath());
      System.out.println(commit.getFullMessage());
      System.out.println(commit.getAuthorIdent().getName());
      System.out.println(commit.getAuthorIdent().getEmailAddress());
    }
                

    Transformations - Search Data

    
    Pattern capitals = Pattern.compile(".*([a-z])([A-Z]).*");
    Matcher m = capitals.matcher(file);
    
    // myAbstractFactory => my Abstract Factory
    String fileNameTokens = 
      m.replaceAll("\1 \2");
    
    // /project/src/large_grid.js => project src large grid js
    fileNameTokens = fileNameTokens
       .replace("/", " ")
       .replace("-", " ")
       .replace(".", " ")
       .replace("_", " ");
      
    String search =
       commit.getFullMessage() + 
       " " + file;
        

    Transformations - Companies

    
    // x.y@google.com => google.com
    String company = emailAddress.split("@")[1];
    
    if (company.contains("."))
    {	
      // abc.com => google
      company = company.substring(0, company.lastIndexOf("."));
    }
    			
    return company;
        

    Transformations - Commit Footers

    • Signed-off-by
    • Acked-by
    • Reported-by
    • Tested-by
    • CC, Cc
    • Bug

    Loading Data in Solr

    
    // Connections happen over HTTP:
    HttpSolrServer server = new HttpSolrServer(
         "http://localhost:8080/solr");
    
    Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
    
    SolrInputDocument doc = new SolrInputDocument();
    
    // ID for the document contains enough information to find it in the source data:
    doc.addField("id", remoteUrl + "." + commit.getId());
    
    doc.addField("author", commit.getAuthorIdent().getName());
    doc.addField("email", commit.getAuthorIdent().getEmailAddress());
    doc.addField("message", commit.getFullMessage());
    
    // Any data we let the user search against is included in this value:
    doc.addField("search", search);
    

    Loading Data - Threading

    
    File[] files = new File("repositories\\").listFiles();
    String lastRepository = getLastRepository(); // resume after failure
    
    int i = 0;
    int numThreads = Runtime.getRuntime().availableProcessors();
            
    for (File f : files) {
      if (i++ % numThreads == _myIndex) { // load every nth repository
        String filename = f.getAbsolutePath() + "\\.git";
                
        convertRepo(server, filename);
      }
      i++;
    }
        

    Setting up Solr

    • Download Solr
    • Modify schema.xml
    • Start Jetty:
      java -jar start.jar

    Solr - Limitations

    • Don't expose Solr publically
    • Don't rely on joins
    • Roll your own ACLs, if needed

    Lessons Learned

    • Parsing Code Comments
    • Testing
    • High Volume Indexing
    • Inspiration for similar tools

    Parsing Comments

            
    @@ -1,9 +1,9 @@
    /* 
    * Here is a multi-line comment.
    */
    String a = "x"; // here is a comment at the end of a line
    
    -String b /* a comment in a line */ = "y"; 
    +String b /* a long comment in a line */ = "y"; 
    
    // Commented out code:
    // String c = "d";       
    

    JSX

    
    var MarkdownEditor = React.createClass({
      render: function() {
        return (
          <div className="MarkdownEditor">
            <h3>Input</h3>
            <div
              className="content"
              dangerouslySetInnerHTML={{
                __html: converter.makeHtml(this.state.value)
              }}
            />
          </div>
        );
      }
    });
    

    Testing ETL processes

    Indexing Many Repositories

    Indexing Many Repositories

    Constraints on ETL processes

    • Real-time updates are hard
    • Plan for power failures
    • Plan for reloading data

    Variations

    • Lots of free data, especially Government - PACER extracts, U.S. Code, Philly Code
    • MS Exchange / Sharepoint
    • D3.js, Tilemill

    References

    https://github.com/garysieling/git-solr-talk

    https://github.com/garysieling/solr-git

    THE END

    BY Gary Sieling / @garysieling

    gsieling@wingspan.com