This is Arvind, - Recruitment and Resources from SancroSoft USA Inc.
We have an urgent requirement as follows:
Please respond with resumes in MS-Word Format with the following details to firstname.lastname@example.org
Full Name :
Contact Number :
Visa Status :
Senior Data Analyst
Location : Reston, VA
Duration: 6-8 weeks( till Nov 13 end)
Responsible for data analysis, validation, cleansing, collection and reporting. Extract and analyze data from various sources, including databases, manual files, and external websites. Respond to data inquiries from various groups within an organization. Create and publish regularly scheduled and/or ad hoc reports as needed. Document reporting requirements and process and validate data components as required. Experience with relational databases and knowledge of query tools and/or statistical software required.�Strong analytical and organizational skills required. Must possess expert level knowledge of MS Excel. 6+ years of prior experience as a Data Analyst is required.
Skills More than 10 years of deep involvement with SQL and data warehouses. 2+ years of experience with HIVE. Working familiarity with Hadoop clusters. Goal The data warehouse is the most comprehensive set of domain-oriented transactions covering most of Verisign's operated TLDs. It consolidates data from both Core as well as the Namestore system to provide a rich set of data over all Verisign operated TLDs (dotNAME will be added in 2013Q3). The data within the warehouse is currently utilized by a multitude of external facing services such as IPS, Data Analyzer, WhoWas and internal facing services and efforts such as Strategy and domain renewal forecasting. The goal of this effort is to provide easy access to current and relevant data fields currently stored within the Data Warehouse, in the Compute Cluster environment. This effort will help existing products utilize the additional data to extend or enhance their products and services, and eliminate the need for additional one-off data transfer scripts. Work We have determined that we will load into the Compute Cluster environment a one time dump of a big table (~1B rows) of 30+ fields that represent all the domain-related transactions that ever occurred; on a daily basis a much smaller table (100k-200k rows) of the day's transactions will be ingested from the Data Warehouse to the Compute Cluster. The work will cover the following 3 areas: 1. Propose and implement in coordination with the Data Warehouse team the method for generating the one time dump and the daily increments and ingesting them into the Compute Cluster. Propose how the data should be stored in the Compute Cluster (e.g., use partitions, append updates into a large file, etc.) 2. Implement HIVE queries for the following user queries (will execute in the Compute Cluster): • What domains aren't currently but have been registered in the past? • Which domains were registered between time X & Y? And passed the AGP? • Which domains expired between time X & Y? • Which domains were active in a particular TLD at time X? • What set of domains were previously registered? • For a given domain, when was it last registered? • How many different registrars has a given domain been registered by? Which ones? • What are the temporal registration patterns of these (a list) domains? • How many times has this domain renewed? Transferred? Deleted? • How many total days has this domain been active? • What were the domains in the zone for a specific date? Perhaps pre-calculate those for every date Additional user queries might be provided for implementation as HIVE queries. If possible, care should be taken to only use UDF's that are also available in IMPALA, as IMPALA might be used for faster query processing of the Data Warehouse data in the Compute Cluster. One known complication is the handling of time as a parameter for those queries in HIVE, given the constraints of HIVE with handling time. One option is to convert time in the Data Warehouse data into UNIX_EPOCH time, so that the HIVE queries can convert user-inputted dates into UNIX_EPOCH time for execution against the source data. Alternative recommendations will be appreciated. 3. Propose and implement the optimal ways for users to interact with the Data Warehouse data in the Compute Cluster, e.g., - pre-compute data sets that will be heavily reused - create HIVE views for queries in (2) that users can re-use with specific parameters - use additional partitions for the source or pre-computed data
Thanks & Regards
The power of focus
SancroSoft USA INC
4944 Sunrise Blvd, Suite B-4 || Fair Oaks, CA 95628
across all IT and Engineering disciplines, nationwide
The information contained in this email message is intended only for the personal and confidential use of the recipient(s) named above. The message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this document in error and that any review, dissemination, distribution, copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately by email and delete the original message.
You received this message because you are subscribed to the Google Groups "SAPABAP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
To post to this group, send email to firstname.lastname@example.org.
Visit this group at http://groups.google.com/group/sapabap.
For more options, visit https://groups.google.com/groups/opt_out.